CN105894032A - Method of extracting effective features based on sample properties - Google Patents
Method of extracting effective features based on sample properties Download PDFInfo
- Publication number
- CN105894032A CN105894032A CN201610202600.5A CN201610202600A CN105894032A CN 105894032 A CN105894032 A CN 105894032A CN 201610202600 A CN201610202600 A CN 201610202600A CN 105894032 A CN105894032 A CN 105894032A
- Authority
- CN
- China
- Prior art keywords
- sample
- feature
- features
- extracted
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 39
- 238000012360 testing method Methods 0.000 claims description 5
- 238000011156 evaluation Methods 0.000 claims description 4
- 230000009191 jumping Effects 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000013145 classification model Methods 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 2
- 239000003550 marker Substances 0.000 claims description 2
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000000605 extraction Methods 0.000 abstract description 15
- 230000008569 process Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000009530 blood pressure measurement Methods 0.000 description 1
- 238000009534 blood test Methods 0.000 description 1
- 238000009529 body temperature measurement Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method of extracting effective features based on sample properties, comprising a training sample feature serialization step, a sample feature selector and corresponding model training step, and a sample model classification step. At the early stage of classification, an initial feature set is set; for each sample needing classification, a feature set needing to be extracted in next step is decided according to the current existing feature set, and then, whether or not to stop extracting features is judged; the previous step is repeated if there is further need to extract features; and if feature extraction is stopped, the sample is input to an appropriate classifier for classification, and a prediction result is obtained. Compared with the prior art, the time cost and classification confidence of sample feature extraction are fully considered.
Description
Technical Field
The invention relates to an effective characteristic extraction technology for a sample in pattern recognition, which is particularly suitable for the problem that the characteristic extraction cost and the classification result reliability need to be considered at the same time.
Background
With the rapid development of the internet and various portable internet devices, the network has become an important component of people's life and an important carrier for the propagation and development of human civilization; more and more data is being propagated through the network. In order to meet different requirements of people on information forms, the information such as characters, sound, images and the like is usually integrated; this results in a complex and complicated data format in the network. Nowadays, more and more complex media data is produced and spread in large quantities in networks. We face the problem of how to efficiently retrieve and classify on these large and complex data. It is therefore desirable to find an efficient and useful way of feature extraction to process this vast amount of information.
Currently, there are many online machine learning methods, such as: online clustering and online classification; they all accelerate the learning process by sampling or optimization strategies. However, these methods are based on the feature proposition overhead not being considered; that is, the overhead of extracting the data from the raw data to the valid features is not taken into account. In fact, it is a not small overhead to extract valid features from the raw data during the operation of the whole classification system; with the data form becoming more and more complex, the proportion of the feature extraction overhead in the whole system becomes larger and larger. How to efficiently extract useful features is a problem that we need to solve.
In medical diagnostic systems, there is a series of tests, such as: body temperature measurement, blood test, blood pressure measurement. However, we do not get the results of all tests during the diagnosis process and then go to diagnosis, which is too costly; but firstly, carrying out preliminary examination, then judging whether to carry out the next examination according to the preliminary examination result, if so, judging which examination to carry out next step, and if not, obtaining a diagnosis conclusion. We inspired by this idea hope to extract a set of features that are most efficient for a sample for different samples to classify, rather than extracting all features, thereby reducing the feature extraction overhead.
The invention content is as follows:
the purpose of the invention is as follows: in the prior art, many machine learning algorithms consider how to improve the efficiency of the learning algorithm from the perspective of sampling or optimization, few algorithms consider the problem of the feature extraction overhead of samples, and the feature extraction overhead is larger and larger as the data form is more and more complex. Aiming at the problems, the invention provides a method for extracting effective features aiming at sample properties, which only extracts simple features, namely features with small expenditure, for samples which are easy to classify; for samples which are difficult to classify, not only simple features are extracted, but also some complex features are extracted to help sample classification.
The technical scheme is as follows: a method for extracting effective features aiming at sample properties is characterized in that an initial feature set is set in the initial stage, for each sample needing to be classified, a feature set needing to be extracted in the next step is determined according to a current existing feature set, and then whether feature extraction is stopped or not is judged. If the characteristics need to be extracted, repeating the previous step; if the feature extraction is stopped, the feature is input into a proper classifier for classification to obtain a prediction result. The method specifically comprises a training sample feature serialization step, a sample feature selector and corresponding model training step and a model classification step aiming at the sample;
the training sample characteristic serialization comprises the following specific steps:
step 100, marking training sample data, and acquiring all characteristics and time overhead of corresponding characteristics;
step 101, calculating Euclidean distances between training sample pairs according to the acquired characteristics;
102, searching a neighbor set of a training sample according to the distance between the sample pairs and the set neighbor number;
step 103, calculating weights of all features of each training sample in a neighbor set of the training samples, namely the useful degree of all groups of features to sample classification;
step 104, sorting the features, wherein the larger the weight value is, the larger the contribution of the features to classification is, the earlier the features should be extracted;
the specific steps of training the sample feature selector and the corresponding model are as follows:
step 200, after training data are serialized, splitting the data according to the existing feature set and the form of feature set to be extracted in the next step to obtain a group of feature set pairs;
step 201, according to the split feature set pair, training a feature selector G based on the current existing features and classifiers aiming at different feature combinations;
the specific steps of the model classification are as follows:
step 300, extracting an initial feature set from a test sample;
step 301, judging whether a next feature set needs to be extracted or not according to the evaluation index, and if so, skipping to step 302; otherwise, jumping to step 303;
step 302, according to the existing feature set and feature selector G, determining a feature set to be extracted next step, merging the current extracted feature set and the existing feature set, and jumping to step 301;
and step 303, searching a trained corresponding classification model according to the current existing feature set for classification.
The specific method for searching the training sample neighbor set in the step 102 is as follows: and sorting the calculated Euclidean distances according to an ascending order, and selecting the first k Euclidean distances according to the set number k of adjacent neighbors.
The method for calculating the weight of the training sample feature in step 103 comprises the following steps: the sum of the weighted weight-average variance of the neighbor and the weighted weight-average variance of each neighbor is minimized, and the specific formula is as follows:
wherein,
Xidenotes the ith feature, X, of the samplejDenotes the j-th feature of the sample, Di(Xj) Represents XiAnd XjWeighted distance between uiA weight representing the ith feature of the sample,iis a sample set consisting of k neighbors of the ith sample; y isiAnd yjRespectively denote the i and j samples, if yi=yjThen r isij1, otherwise rij=-1;ciAnd λ is a set parameter, ciAnd the upper limit of the sample distance between the same class is represented, and lambda is a regularization parameter.
The specific formula of the feature selector G in step 201 is as follows:
wherein x islRepresenting the characteristics which have been extracted for the previous l times, l representing the characteristics extracted for the first time, c representing the characteristics extracted for the next time,representing the extracted feature set, f is a function on the features, w represents a linear coefficient;
the function f of the features is expressed as:
f(xl,c)=xl1TC (4)
1Tis a vector with the size of 1 × m and all elements of 1, wherein m is the number of groups for extracting features; c denotes a diagonal matrix, CkkDenotes the element on the k-th main diagonal line, when C is k, Ckk1, otherwise Ckk=-1。
The linear coefficient w is expressed as:
indicating that the ith sample has extracted l sets of features, cl+1Representing the characteristics to be extracted in the step of i +1 of the ith sample,representing other candidate characteristics except the characteristic to be extracted in the step l +1, and defining delta as delta (c)i,ci)=0,Δ(ci,cj) 1, where i ≠ j,for the relaxation variable, α is the regularization parameter.
The classifier C in the step 201sThe specific formula of (A) is as follows:
wherein x issRepresenting the extracted features, y represents the labels of the sample, Z represents the label space, i.e. the set of all labels, f is a function on the features, V is solved according to the following optimization formula:
representing the extracted features of the ith sample, yiA marker representing the ith sample,marking y for removing sampleiIn addition to other notation, Δ is defined as Δ (y)i,yi)=0,Here, the iIs the relaxation variable.
The step 301 evaluation index includes a time line threshold of the extracted feature and a classification accuracy requirement of the classifier.
Has the advantages that: compared with the prior art, the method fully considers the time overhead of sample feature extraction and the confidence coefficient of classification. The method extracts the most classified characteristics of the samples of the type by utilizing the characteristics of each sample, and only extracts some basic characteristics aiming at simple samples; for complex samples, more features are extracted. Because the action degrees of different feature sets are different for the same sample, the invention provides the features which are most beneficial to classification, and is beneficial to improving the classification precision.
Drawings
FIG. 1 is a flow chart of the operation of the training sample feature serialization phase of the present invention;
FIG. 2 is a flowchart illustrating the operation of the sample feature selector and corresponding model training phase of the present invention;
FIG. 3 is a workflow diagram of the model classification phase for a sample of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
The workflow of the training sample feature serialization phase is shown in fig. 1. A certain amount of data with labels and all features is required at this stage of feature serialization for training data. In actual use, a company can label a batch of data and obtain all its features and the time overhead of the corresponding features (step 10); then, calculating Euclidean distances between the training samples according to the characteristics of the data (step 11); selecting a corresponding number of neighbors according to the set number of neighbors (step 12); next, calculating the weight of each feature of each training sample (step 13); finally, the features are sorted according to this weight, with larger weights ranked ahead. This results in our serialized training samples.
The flow of the sample feature selector and the corresponding model training phase is shown in fig. 2. Firstly, splitting a training sample serialized in the previous process to obtain a group of existing feature sets and a feature set pair of which the feature set needs to be extracted in the next step (step 15); then, based on these feature set pairs, a feature selector G is trained. Meanwhile, according to the feature combination of the training samples, a corresponding classification model C is traineds(step 16).
The model classification workflow for the sample is shown in fig. 3. Firstly, extracting an initial feature set for a test sample (step 18); then, judging whether the existing features meet the requirement of stopping feature extraction, wherein the requirement of stopping feature extraction can be a time line threshold value of feature extraction or the accuracy (which can be selected according to the actual situation requirement) which can be reached by a classifier (step 19); if the requirement of stopping extracting the features is reached, the matched model can be directly selected for classification to obtain a classification result (step 20 a); otherwise, according to the feature selector, the feature to be extracted next is selected, and the process returns to step 19 (step 20 b).
Claims (6)
1. A method of extracting valid features for a sample property, comprising: training sample feature serialization, training a sample feature selector and a corresponding model and classifying a model aiming at the sample;
the training sample characteristic serialization comprises the following specific steps:
step 100, marking training sample data, and acquiring all characteristics and time overhead of corresponding characteristics;
step 101, calculating Euclidean distances between training sample pairs according to the acquired characteristics;
102, searching a neighbor set of a training sample according to the distance between the sample pairs and the set neighbor number;
step 103, calculating weights of all features of each training sample in a neighbor set of the training samples, namely the useful degree of all groups of features to sample classification;
step 104, sorting the features, wherein the larger the weight value is, the larger the contribution of the features to classification is, the earlier the features should be extracted;
the specific steps of training the sample feature selector and the corresponding model are as follows:
step 200, after training data are serialized, splitting the data according to the existing feature set and the form of feature set to be extracted in the next step to obtain a group of feature set pairs;
step 201, according to the split feature set pair, training a feature selector G based on the current existing features and classifiers aiming at different feature combinations;
the specific steps of the model classification are as follows:
step 300, extracting an initial feature set from a test sample;
step 301, judging whether a next feature set needs to be extracted or not according to the evaluation index, and if so, skipping to step 302; otherwise, jumping to step 303;
step 302, according to the existing feature set and feature selector G, determining a feature set to be extracted next step, merging the current extracted feature set and the existing feature set, and jumping to step 301;
and step 303, searching a trained corresponding classification model according to the current existing feature set for classification.
2. The method of extracting valid features for a sample property of claim 1, wherein: the specific method for searching the training sample neighbor set in the step 102 is as follows: and sorting the calculated Euclidean distances according to an ascending order, and selecting the first k Euclidean distances according to the set number k of adjacent neighbors.
3. The method of extracting valid features for a sample property of claim 1, wherein: the method for calculating the weight of the training sample feature in step 103 comprises the following steps: the sum of weighted weight-average variances of the training sample and each neighbor is minimized, and the specific formula is as follows:
wherein,
Xidenotes the ith feature, X, of the samplejDenotes the j-th feature of the sample, Di(Xj) Represents XiAnd XjWeighted distance between uiA weight representing the ith feature of the sample,iis formed by k of the ith sampleA sample set composed of neighbors; y isiAnd yjRespectively denote the i and j samples, if yi=yjThen r isij1, otherwise rij=-1;ciAnd λ is a set parameter, ciAnd the upper limit of the sample distance between the same class is represented, and lambda is a regularization parameter.
4. The method of extracting valid features for a sample property of claim 1, wherein: the specific formula of the feature selector G in step 201 is as follows:
wherein x islRepresenting the characteristics which have been extracted for the previous l times, l representing the characteristics extracted for the first time, c representing the characteristics extracted for the next time,representing the extracted feature set, f is a function on the features, w represents a linear coefficient;
the function f of the features is expressed as:
f(xl,c)=xl1TC (4)
1Tis a vector with the size of 1 × m and all elements of 1, wherein m is the number of groups for extracting features; c denotes a diagonal matrix, CkkDenotes the element on the k-th main diagonal line, when C is k, Ckk1, otherwise Ckk=-1。
The linear coefficient w is expressed as:
indicating that the ith sample has extracted l sets of features, cl+1Representing the characteristics to be extracted in the step of i +1 of the ith sample,representing other candidate characteristics except the characteristic to be extracted in the step l +1, and defining delta as delta (c)i,ci)=0,Δ(ci,cj) 1, where i ≠ j,for the relaxation variable, α is the regularization parameter.
5. The method as claimed in claim 1The method for extracting the effective features from the sample properties is characterized by comprising the following steps: the classifier C in the step 201sThe specific formula of (A) is as follows:
wherein x issRepresenting the extracted features, y represents the labels of the sample, Z represents the label space, i.e. the set of all labels, f is a function on the features, V is solved according to the following optimization formula:
representing the extracted features of the ith sample, yiA marker representing the ith sample,marking y for removing sampleiIn addition to other notation, Δ is defined as Δ (y)i,yi)=0,Here, the iIs the relaxation variable.
6. The method of extracting valid features for a sample property of claim 1, wherein: the step 301 evaluation index includes a time line threshold of the extracted feature and a classification accuracy requirement of the classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610202600.5A CN105894032A (en) | 2016-04-01 | 2016-04-01 | Method of extracting effective features based on sample properties |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610202600.5A CN105894032A (en) | 2016-04-01 | 2016-04-01 | Method of extracting effective features based on sample properties |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105894032A true CN105894032A (en) | 2016-08-24 |
Family
ID=57012118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610202600.5A Pending CN105894032A (en) | 2016-04-01 | 2016-04-01 | Method of extracting effective features based on sample properties |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105894032A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845537A (en) * | 2017-01-09 | 2017-06-13 | 北京邮电大学 | A kind of grader radius based on adaptive threshold determines method and device |
CN107358209A (en) * | 2017-07-17 | 2017-11-17 | 成都通甲优博科技有限责任公司 | Training method, device and method for detecting human face, the device of Face datection model |
CN107545274A (en) * | 2017-07-18 | 2018-01-05 | 北京建筑大学 | Semi-supervised label ratio learning method |
CN111314691A (en) * | 2018-12-11 | 2020-06-19 | 中国移动通信集团广东有限公司 | Video call quality assessment method and device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095902A (en) * | 2014-05-23 | 2015-11-25 | 华为技术有限公司 | Method and apparatus for extracting image features |
-
2016
- 2016-04-01 CN CN201610202600.5A patent/CN105894032A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095902A (en) * | 2014-05-23 | 2015-11-25 | 华为技术有限公司 | Method and apparatus for extracting image features |
Non-Patent Citations (1)
Title |
---|
LI-PING LIU等: ""TEFE: A Time-Efficient Approach to Feature Extraction"", 《IN PROCEEDINGS OF THE 8TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845537A (en) * | 2017-01-09 | 2017-06-13 | 北京邮电大学 | A kind of grader radius based on adaptive threshold determines method and device |
CN106845537B (en) * | 2017-01-09 | 2020-12-04 | 北京邮电大学 | Classifier radius determination method and device based on self-adaptive threshold |
CN107358209A (en) * | 2017-07-17 | 2017-11-17 | 成都通甲优博科技有限责任公司 | Training method, device and method for detecting human face, the device of Face datection model |
CN107545274A (en) * | 2017-07-18 | 2018-01-05 | 北京建筑大学 | Semi-supervised label ratio learning method |
CN111314691A (en) * | 2018-12-11 | 2020-06-19 | 中国移动通信集团广东有限公司 | Video call quality assessment method and device |
CN111314691B (en) * | 2018-12-11 | 2022-09-16 | 中国移动通信集团广东有限公司 | Video call quality assessment method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109189767B (en) | Data processing method and device, electronic equipment and storage medium | |
CN110188047B (en) | Double-channel convolutional neural network-based repeated defect report detection method | |
CN110209823A (en) | A kind of multi-tag file classification method and system | |
CN103617429A (en) | Sorting method and system for active learning | |
CN104966105A (en) | Robust machine error retrieving method and system | |
CN107943856A (en) | A kind of file classification method and system based on expansion marker samples | |
CN103699523A (en) | Product classification method and device | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
US20140198980A1 (en) | Image identification apparatus, image identification method, and non-transitory computer readable medium | |
CN105205501A (en) | Multi-classifier combined weak annotation image object detection method | |
CN101882136B (en) | Method for analyzing emotion tendentiousness of text | |
CN110414587A (en) | Depth convolutional neural networks training method and system based on progressive learning | |
CN106156805A (en) | A kind of classifier training method of sample label missing data | |
CN105894032A (en) | Method of extracting effective features based on sample properties | |
CN108877947A (en) | Depth sample learning method based on iteration mean cluster | |
CN110019779B (en) | Text classification method, model training method and device | |
CN104750875A (en) | Machine error data classification method and system | |
CN109933619A (en) | A kind of semisupervised classification prediction technique | |
CN102129565B (en) | Object detection method based on feature redundancy elimination AdaBoost classifier | |
CN114139634A (en) | Multi-label feature selection method based on paired label weights | |
CN104978569A (en) | Sparse representation based incremental face recognition method | |
CN111984790B (en) | Entity relation extraction method | |
CN103559510B (en) | Method for recognizing social group behaviors through related topic model | |
CN103955676B (en) | Human face identification method and system | |
CN107909090A (en) | Learn semi-supervised music-book on pianoforte difficulty recognition methods based on estimating |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160824 |