[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN105894032A - Method of extracting effective features based on sample properties - Google Patents

Method of extracting effective features based on sample properties Download PDF

Info

Publication number
CN105894032A
CN105894032A CN201610202600.5A CN201610202600A CN105894032A CN 105894032 A CN105894032 A CN 105894032A CN 201610202600 A CN201610202600 A CN 201610202600A CN 105894032 A CN105894032 A CN 105894032A
Authority
CN
China
Prior art keywords
sample
feature
features
extracted
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610202600.5A
Other languages
Chinese (zh)
Inventor
詹德川
姜�远
周志华
李静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201610202600.5A priority Critical patent/CN105894032A/en
Publication of CN105894032A publication Critical patent/CN105894032A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method of extracting effective features based on sample properties, comprising a training sample feature serialization step, a sample feature selector and corresponding model training step, and a sample model classification step. At the early stage of classification, an initial feature set is set; for each sample needing classification, a feature set needing to be extracted in next step is decided according to the current existing feature set, and then, whether or not to stop extracting features is judged; the previous step is repeated if there is further need to extract features; and if feature extraction is stopped, the sample is input to an appropriate classifier for classification, and a prediction result is obtained. Compared with the prior art, the time cost and classification confidence of sample feature extraction are fully considered.

Description

Method for extracting effective features aiming at sample properties
Technical Field
The invention relates to an effective characteristic extraction technology for a sample in pattern recognition, which is particularly suitable for the problem that the characteristic extraction cost and the classification result reliability need to be considered at the same time.
Background
With the rapid development of the internet and various portable internet devices, the network has become an important component of people's life and an important carrier for the propagation and development of human civilization; more and more data is being propagated through the network. In order to meet different requirements of people on information forms, the information such as characters, sound, images and the like is usually integrated; this results in a complex and complicated data format in the network. Nowadays, more and more complex media data is produced and spread in large quantities in networks. We face the problem of how to efficiently retrieve and classify on these large and complex data. It is therefore desirable to find an efficient and useful way of feature extraction to process this vast amount of information.
Currently, there are many online machine learning methods, such as: online clustering and online classification; they all accelerate the learning process by sampling or optimization strategies. However, these methods are based on the feature proposition overhead not being considered; that is, the overhead of extracting the data from the raw data to the valid features is not taken into account. In fact, it is a not small overhead to extract valid features from the raw data during the operation of the whole classification system; with the data form becoming more and more complex, the proportion of the feature extraction overhead in the whole system becomes larger and larger. How to efficiently extract useful features is a problem that we need to solve.
In medical diagnostic systems, there is a series of tests, such as: body temperature measurement, blood test, blood pressure measurement. However, we do not get the results of all tests during the diagnosis process and then go to diagnosis, which is too costly; but firstly, carrying out preliminary examination, then judging whether to carry out the next examination according to the preliminary examination result, if so, judging which examination to carry out next step, and if not, obtaining a diagnosis conclusion. We inspired by this idea hope to extract a set of features that are most efficient for a sample for different samples to classify, rather than extracting all features, thereby reducing the feature extraction overhead.
The invention content is as follows:
the purpose of the invention is as follows: in the prior art, many machine learning algorithms consider how to improve the efficiency of the learning algorithm from the perspective of sampling or optimization, few algorithms consider the problem of the feature extraction overhead of samples, and the feature extraction overhead is larger and larger as the data form is more and more complex. Aiming at the problems, the invention provides a method for extracting effective features aiming at sample properties, which only extracts simple features, namely features with small expenditure, for samples which are easy to classify; for samples which are difficult to classify, not only simple features are extracted, but also some complex features are extracted to help sample classification.
The technical scheme is as follows: a method for extracting effective features aiming at sample properties is characterized in that an initial feature set is set in the initial stage, for each sample needing to be classified, a feature set needing to be extracted in the next step is determined according to a current existing feature set, and then whether feature extraction is stopped or not is judged. If the characteristics need to be extracted, repeating the previous step; if the feature extraction is stopped, the feature is input into a proper classifier for classification to obtain a prediction result. The method specifically comprises a training sample feature serialization step, a sample feature selector and corresponding model training step and a model classification step aiming at the sample;
the training sample characteristic serialization comprises the following specific steps:
step 100, marking training sample data, and acquiring all characteristics and time overhead of corresponding characteristics;
step 101, calculating Euclidean distances between training sample pairs according to the acquired characteristics;
102, searching a neighbor set of a training sample according to the distance between the sample pairs and the set neighbor number;
step 103, calculating weights of all features of each training sample in a neighbor set of the training samples, namely the useful degree of all groups of features to sample classification;
step 104, sorting the features, wherein the larger the weight value is, the larger the contribution of the features to classification is, the earlier the features should be extracted;
the specific steps of training the sample feature selector and the corresponding model are as follows:
step 200, after training data are serialized, splitting the data according to the existing feature set and the form of feature set to be extracted in the next step to obtain a group of feature set pairs;
step 201, according to the split feature set pair, training a feature selector G based on the current existing features and classifiers aiming at different feature combinations;
the specific steps of the model classification are as follows:
step 300, extracting an initial feature set from a test sample;
step 301, judging whether a next feature set needs to be extracted or not according to the evaluation index, and if so, skipping to step 302; otherwise, jumping to step 303;
step 302, according to the existing feature set and feature selector G, determining a feature set to be extracted next step, merging the current extracted feature set and the existing feature set, and jumping to step 301;
and step 303, searching a trained corresponding classification model according to the current existing feature set for classification.
The specific method for searching the training sample neighbor set in the step 102 is as follows: and sorting the calculated Euclidean distances according to an ascending order, and selecting the first k Euclidean distances according to the set number k of adjacent neighbors.
The method for calculating the weight of the training sample feature in step 103 comprises the following steps: the sum of the weighted weight-average variance of the neighbor and the weighted weight-average variance of each neighbor is minimized, and the specific formula is as follows:
arg min u i Σ j ∈ δ i l o g ( 1 + exp ( r i j ( D i ( x j ) - c i ) ) ) + λ | | u i | | 1 s . t . u i ≥ 0 - - - ( 1 )
wherein,
Xidenotes the ith feature, X, of the samplejDenotes the j-th feature of the sample, Di(Xj) Represents XiAnd XjWeighted distance between uiA weight representing the ith feature of the sample,iis a sample set consisting of k neighbors of the ith sample; y isiAnd yjRespectively denote the i and j samples, if yi=yjThen r isij1, otherwise rij=-1;ciAnd λ is a set parameter, ciAnd the upper limit of the sample distance between the same class is represented, and lambda is a regularization parameter.
The specific formula of the feature selector G in step 201 is as follows:
wherein x islRepresenting the characteristics which have been extracted for the previous l times, l representing the characteristics extracted for the first time, c representing the characteristics extracted for the next time,representing the extracted feature set, f is a function on the features, w represents a linear coefficient;
the function f of the features is expressed as:
f(xl,c)=xl1TC (4)
1Tis a vector with the size of 1 × m and all elements of 1, wherein m is the number of groups for extracting features; c denotes a diagonal matrix, CkkDenotes the element on the k-th main diagonal line, when C is k, Ckk1, otherwise Ckk=-1。
The linear coefficient w is expressed as:
arg min w | | w | | 2 2 + α Σ i , l ξ i l s . t . w T f ( X i l , c l + 1 ) > Δ ( c l + 1 , c l + 1 ^ ) + w T f ( X i l , c ^ l + 1 ) - ξ i l - - - ( 5 )
indicating that the ith sample has extracted l sets of features, cl+1Representing the characteristics to be extracted in the step of i +1 of the ith sample,representing other candidate characteristics except the characteristic to be extracted in the step l +1, and defining delta as delta (c)i,ci)=0,Δ(ci,cj) 1, where i ≠ j,for the relaxation variable, α is the regularization parameter.
The classifier C in the step 201sThe specific formula of (A) is as follows:
C s ( x s ) = arg max y ∈ Z V T f ( x s , y ) - - - ( 6 )
wherein x issRepresenting the extracted features, y represents the labels of the sample, Z represents the label space, i.e. the set of all labels, f is a function on the features, V is solved according to the following optimization formula:
arg min V | | V | | 2 2 + D Σ i ϵ i s . t . V T f ( x i s , y i ) > Δ ( y i , y ^ ) + V T f ( x i s , y ^ ) - ϵ i - - - ( 7 )
representing the extracted features of the ith sample, yiA marker representing the ith sample,marking y for removing sampleiIn addition to other notation, Δ is defined as Δ (y)i,yi)=0,Here, the iIs the relaxation variable.
The step 301 evaluation index includes a time line threshold of the extracted feature and a classification accuracy requirement of the classifier.
Has the advantages that: compared with the prior art, the method fully considers the time overhead of sample feature extraction and the confidence coefficient of classification. The method extracts the most classified characteristics of the samples of the type by utilizing the characteristics of each sample, and only extracts some basic characteristics aiming at simple samples; for complex samples, more features are extracted. Because the action degrees of different feature sets are different for the same sample, the invention provides the features which are most beneficial to classification, and is beneficial to improving the classification precision.
Drawings
FIG. 1 is a flow chart of the operation of the training sample feature serialization phase of the present invention;
FIG. 2 is a flowchart illustrating the operation of the sample feature selector and corresponding model training phase of the present invention;
FIG. 3 is a workflow diagram of the model classification phase for a sample of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
The workflow of the training sample feature serialization phase is shown in fig. 1. A certain amount of data with labels and all features is required at this stage of feature serialization for training data. In actual use, a company can label a batch of data and obtain all its features and the time overhead of the corresponding features (step 10); then, calculating Euclidean distances between the training samples according to the characteristics of the data (step 11); selecting a corresponding number of neighbors according to the set number of neighbors (step 12); next, calculating the weight of each feature of each training sample (step 13); finally, the features are sorted according to this weight, with larger weights ranked ahead. This results in our serialized training samples.
The flow of the sample feature selector and the corresponding model training phase is shown in fig. 2. Firstly, splitting a training sample serialized in the previous process to obtain a group of existing feature sets and a feature set pair of which the feature set needs to be extracted in the next step (step 15); then, based on these feature set pairs, a feature selector G is trained. Meanwhile, according to the feature combination of the training samples, a corresponding classification model C is traineds(step 16).
The model classification workflow for the sample is shown in fig. 3. Firstly, extracting an initial feature set for a test sample (step 18); then, judging whether the existing features meet the requirement of stopping feature extraction, wherein the requirement of stopping feature extraction can be a time line threshold value of feature extraction or the accuracy (which can be selected according to the actual situation requirement) which can be reached by a classifier (step 19); if the requirement of stopping extracting the features is reached, the matched model can be directly selected for classification to obtain a classification result (step 20 a); otherwise, according to the feature selector, the feature to be extracted next is selected, and the process returns to step 19 (step 20 b).

Claims (6)

1. A method of extracting valid features for a sample property, comprising: training sample feature serialization, training a sample feature selector and a corresponding model and classifying a model aiming at the sample;
the training sample characteristic serialization comprises the following specific steps:
step 100, marking training sample data, and acquiring all characteristics and time overhead of corresponding characteristics;
step 101, calculating Euclidean distances between training sample pairs according to the acquired characteristics;
102, searching a neighbor set of a training sample according to the distance between the sample pairs and the set neighbor number;
step 103, calculating weights of all features of each training sample in a neighbor set of the training samples, namely the useful degree of all groups of features to sample classification;
step 104, sorting the features, wherein the larger the weight value is, the larger the contribution of the features to classification is, the earlier the features should be extracted;
the specific steps of training the sample feature selector and the corresponding model are as follows:
step 200, after training data are serialized, splitting the data according to the existing feature set and the form of feature set to be extracted in the next step to obtain a group of feature set pairs;
step 201, according to the split feature set pair, training a feature selector G based on the current existing features and classifiers aiming at different feature combinations;
the specific steps of the model classification are as follows:
step 300, extracting an initial feature set from a test sample;
step 301, judging whether a next feature set needs to be extracted or not according to the evaluation index, and if so, skipping to step 302; otherwise, jumping to step 303;
step 302, according to the existing feature set and feature selector G, determining a feature set to be extracted next step, merging the current extracted feature set and the existing feature set, and jumping to step 301;
and step 303, searching a trained corresponding classification model according to the current existing feature set for classification.
2. The method of extracting valid features for a sample property of claim 1, wherein: the specific method for searching the training sample neighbor set in the step 102 is as follows: and sorting the calculated Euclidean distances according to an ascending order, and selecting the first k Euclidean distances according to the set number k of adjacent neighbors.
3. The method of extracting valid features for a sample property of claim 1, wherein: the method for calculating the weight of the training sample feature in step 103 comprises the following steps: the sum of weighted weight-average variances of the training sample and each neighbor is minimized, and the specific formula is as follows:
arg m i n u i Σ j ∈ δ i l o g ( 1 + exp ( r i j ( D i ( x j ) - c i ) ) ) + λ | | u i | | 1 s . t . u i ≥ 0 - - - ( 1 )
wherein,
Xidenotes the ith feature, X, of the samplejDenotes the j-th feature of the sample, Di(Xj) Represents XiAnd XjWeighted distance between uiA weight representing the ith feature of the sample,iis formed by k of the ith sampleA sample set composed of neighbors; y isiAnd yjRespectively denote the i and j samples, if yi=yjThen r isij1, otherwise rij=-1;ciAnd λ is a set parameter, ciAnd the upper limit of the sample distance between the same class is represented, and lambda is a regularization parameter.
4. The method of extracting valid features for a sample property of claim 1, wherein: the specific formula of the feature selector G in step 201 is as follows:
wherein x islRepresenting the characteristics which have been extracted for the previous l times, l representing the characteristics extracted for the first time, c representing the characteristics extracted for the next time,representing the extracted feature set, f is a function on the features, w represents a linear coefficient;
the function f of the features is expressed as:
f(xl,c)=xl1TC (4)
1Tis a vector with the size of 1 × m and all elements of 1, wherein m is the number of groups for extracting features; c denotes a diagonal matrix, CkkDenotes the element on the k-th main diagonal line, when C is k, Ckk1, otherwise Ckk=-1。
The linear coefficient w is expressed as:
arg min w | | w | | 2 2 + α Σ i , l ξ i l s . t . w T f ( X i l , c l + 1 ) > Δ ( c l + 1 , c l + 1 ^ ) + w T f ( X i l , c ^ l + 1 ) - ξ i l - - - ( 5 )
indicating that the ith sample has extracted l sets of features, cl+1Representing the characteristics to be extracted in the step of i +1 of the ith sample,representing other candidate characteristics except the characteristic to be extracted in the step l +1, and defining delta as delta (c)i,ci)=0,Δ(ci,cj) 1, where i ≠ j,for the relaxation variable, α is the regularization parameter.
5. The method as claimed in claim 1The method for extracting the effective features from the sample properties is characterized by comprising the following steps: the classifier C in the step 201sThe specific formula of (A) is as follows:
C s ( x s ) = arg m a x y ∈ Z V T f ( x s , y ) - - - ( 6 )
wherein x issRepresenting the extracted features, y represents the labels of the sample, Z represents the label space, i.e. the set of all labels, f is a function on the features, V is solved according to the following optimization formula:
arg min V | | V | | 2 2 + D Σ i ϵ i s . t . V T f ( x i s , y i ) > Δ ( y i , y ^ ) + V T f ( x i s , y ^ ) - ϵ i - - - ( 7 )
representing the extracted features of the ith sample, yiA marker representing the ith sample,marking y for removing sampleiIn addition to other notation, Δ is defined as Δ (y)i,yi)=0,Here, the iIs the relaxation variable.
6. The method of extracting valid features for a sample property of claim 1, wherein: the step 301 evaluation index includes a time line threshold of the extracted feature and a classification accuracy requirement of the classifier.
CN201610202600.5A 2016-04-01 2016-04-01 Method of extracting effective features based on sample properties Pending CN105894032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610202600.5A CN105894032A (en) 2016-04-01 2016-04-01 Method of extracting effective features based on sample properties

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610202600.5A CN105894032A (en) 2016-04-01 2016-04-01 Method of extracting effective features based on sample properties

Publications (1)

Publication Number Publication Date
CN105894032A true CN105894032A (en) 2016-08-24

Family

ID=57012118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610202600.5A Pending CN105894032A (en) 2016-04-01 2016-04-01 Method of extracting effective features based on sample properties

Country Status (1)

Country Link
CN (1) CN105894032A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845537A (en) * 2017-01-09 2017-06-13 北京邮电大学 A kind of grader radius based on adaptive threshold determines method and device
CN107358209A (en) * 2017-07-17 2017-11-17 成都通甲优博科技有限责任公司 Training method, device and method for detecting human face, the device of Face datection model
CN107545274A (en) * 2017-07-18 2018-01-05 北京建筑大学 Semi-supervised label ratio learning method
CN111314691A (en) * 2018-12-11 2020-06-19 中国移动通信集团广东有限公司 Video call quality assessment method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095902A (en) * 2014-05-23 2015-11-25 华为技术有限公司 Method and apparatus for extracting image features

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095902A (en) * 2014-05-23 2015-11-25 华为技术有限公司 Method and apparatus for extracting image features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI-PING LIU等: ""TEFE: A Time-Efficient Approach to Feature Extraction"", 《IN PROCEEDINGS OF THE 8TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845537A (en) * 2017-01-09 2017-06-13 北京邮电大学 A kind of grader radius based on adaptive threshold determines method and device
CN106845537B (en) * 2017-01-09 2020-12-04 北京邮电大学 Classifier radius determination method and device based on self-adaptive threshold
CN107358209A (en) * 2017-07-17 2017-11-17 成都通甲优博科技有限责任公司 Training method, device and method for detecting human face, the device of Face datection model
CN107545274A (en) * 2017-07-18 2018-01-05 北京建筑大学 Semi-supervised label ratio learning method
CN111314691A (en) * 2018-12-11 2020-06-19 中国移动通信集团广东有限公司 Video call quality assessment method and device
CN111314691B (en) * 2018-12-11 2022-09-16 中国移动通信集团广东有限公司 Video call quality assessment method and device

Similar Documents

Publication Publication Date Title
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN110188047B (en) Double-channel convolutional neural network-based repeated defect report detection method
CN110209823A (en) A kind of multi-tag file classification method and system
CN103617429A (en) Sorting method and system for active learning
CN104966105A (en) Robust machine error retrieving method and system
CN107943856A (en) A kind of file classification method and system based on expansion marker samples
CN103699523A (en) Product classification method and device
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
US20140198980A1 (en) Image identification apparatus, image identification method, and non-transitory computer readable medium
CN105205501A (en) Multi-classifier combined weak annotation image object detection method
CN101882136B (en) Method for analyzing emotion tendentiousness of text
CN110414587A (en) Depth convolutional neural networks training method and system based on progressive learning
CN106156805A (en) A kind of classifier training method of sample label missing data
CN105894032A (en) Method of extracting effective features based on sample properties
CN108877947A (en) Depth sample learning method based on iteration mean cluster
CN110019779B (en) Text classification method, model training method and device
CN104750875A (en) Machine error data classification method and system
CN109933619A (en) A kind of semisupervised classification prediction technique
CN102129565B (en) Object detection method based on feature redundancy elimination AdaBoost classifier
CN114139634A (en) Multi-label feature selection method based on paired label weights
CN104978569A (en) Sparse representation based incremental face recognition method
CN111984790B (en) Entity relation extraction method
CN103559510B (en) Method for recognizing social group behaviors through related topic model
CN103955676B (en) Human face identification method and system
CN107909090A (en) Learn semi-supervised music-book on pianoforte difficulty recognition methods based on estimating

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160824