CN107016073B - A kind of text classification feature selection approach - Google Patents
A kind of text classification feature selection approach Download PDFInfo
- Publication number
- CN107016073B CN107016073B CN201710181572.8A CN201710181572A CN107016073B CN 107016073 B CN107016073 B CN 107016073B CN 201710181572 A CN201710181572 A CN 201710181572A CN 107016073 B CN107016073 B CN 107016073B
- Authority
- CN
- China
- Prior art keywords
- feature
- degree
- sel
- exc
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000035945 sensitivity Effects 0.000 claims abstract description 35
- 238000000034 method Methods 0.000 claims abstract description 12
- 230000007717 exclusion Effects 0.000 claims description 39
- 238000010187 selection method Methods 0.000 claims description 23
- 238000010801 machine learning Methods 0.000 abstract description 3
- 239000013598 vector Substances 0.000 description 4
- 238000000546 chi-square test Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of text classification feature selection approach, can reduce characteristic dimension and complicated classification degree and improves classification accuracy.The described method includes: obtaining feature set S and target category C, each feature x in feature set S is calculated(i)Degree of association R between target category Cc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;Calculate the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, degree of association R between binding characteristic and target categoryc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with the descending sort to feature set S as a result, feature set S is divided into Candidate Set S according to threshold value thselCollect S with exclusionexc;Calculate Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by it compared with preset threshold value th, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted.The present invention is suitable for machine learning text classification field.
Description
Technical Field
The invention relates to the field of machine learning text classification, in particular to a text classification feature selection method.
Background
With the continuous expansion of the internet scale, the information resources gathered in the internet are also increasing. Content-based information retrieval and data mining have been of interest for effective management and convenient utilization of these information resources. Text classification techniques are an important basis for information retrieval and text data mining, with the main task of discriminating unknown classes of words and documents into one or more of a predetermined class based on their content. However, the two characteristics of large training sample number and high vector dimension determine that text classification is a machine learning problem with high operation time and space complexity. Therefore, feature selection is required to reduce feature dimension while ensuring classification performance as much as possible.
Feature selection is an important data preprocessing process, and in a common text classification feature selection method, a Chi-Square test (Chi-Square) selects words with large deviation from hypothesis as features by establishing a null hypothesis and assuming that the words are not related to target categories. It only counts whether a word appears in the document, regardless of the number of occurrences, which makes it biased towards low-frequency words. Mutual Information (Mutual Information) methods select features by the amount of Information brought to the target class by the presence of quantifier words. But it only considers the degree of association between the words and the target categories, neglecting possible dependencies between the words. The TF-IDF (Term Frequency-Inverse Document Frequency) method comprehensively considers the Frequency of the words appearing in the files and the distribution of the words in all the files to evaluate the importance degree of the words, so as to select the characteristics. But it is simply the more important it is to think that words with small text frequency are and the less useful it is that words with large text frequency are, so the accuracy is not very high. In addition, feature selection methods such as information gain, dominance rate, text evidence weight, expected cross entropy and the like are also provided, most of the methods only consider the correlation degree between words and target categories or the correlation degree between words, and the problems of insufficient dimension reduction degree or low classification precision are easily caused.
Disclosure of Invention
The invention aims to provide a text classification feature selection method to solve the problems of high feature dimension or low classification precision in the prior art.
In order to solve the above technical problem, an embodiment of the present invention provides a text classification feature selection method, including:
step 1: acquiring a feature set S and a target class C, and calculating each feature x in the feature set S(i)Degree of association R with object class Cc(x(i)) And according to the degree of association RcSorting the feature set S in a descending order according to the size;
step 2: calculating the redundancy R between every two features in the feature set SxAnd degree of synergy SxAssociation degree R between combined features and object classesc(x(i)) Calculating the sensitivity Sen of the features, comparing the sensitivity Sen with a preset threshold th, combining the descending sorting result of the feature set S, and dividing the feature set S into a candidate set S according to the threshold thselAnd exclusion set Sexc;
And step 3: computing a candidate set SselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcAnd (6) adjusting.
Further, the step 1 comprises:
step 11, for each of the feature sets SA feature x(i)According to the formula Rc(x(i))=I(x(i)(ii) a C) Computing feature x(i)Degree of association R with object class Cc(x(i)) Wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C;
step 12, according to the relevance Rc(x(i)) The size of the feature set S is determined by the size of the feature set S;
wherein x is(i)Represents the ith feature, R, in the feature set Sc(x(i)) Represents a feature x(i)Degree of association with the target class C.
Further, the I (x)(i)(ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),ck) Represents a feature x(i)And class ckProbability of co-occurrence, p (x)(i)|ck) Is shown at ckFeature in class x(i)Probability of occurrence, p (x)(i)) Represents a feature x(i)Probability of occurrence in the feature set S.
Further, the redundancy RxExpressed as:
Rx(x(i);x(j))=min(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Rx(x(i);x(j)) Represents a feature x(i)And feature x(j)Redundancy between, Rx(x(i);x(j)) Is the smaller of 0 and the correlation gain.
Further, the degree of agreement SxExpressed as:
Sx(x(i);x(j))=max(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Sx(x(i);x(j)) Represents a feature x(i)And feature x(j)Degree of coordination between, Sx(x(i);x(j)) Is 0 and the larger of the correlation gains.
Further, the IG (x)(i);x(j)(ii) a C) Expressed as:
IG(x(i);x(j);C)=I[(x(i),x(j));C]-I(x(i);C)-I(x(j);C)
wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C; i (x)(j)(ii) a C) Represents a feature x(j)Mutual information with the target class C; i ((x)(i),x(j)(ii) a C) Represents a feature x(i)Characteristic x(j)And the target class C.
Further, the I ((x)(i),x(j)(ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),x(j),ck) Represents a feature x(i)Characteristic x(j)And class ckProbability of co-occurrence, p ((x)(i),x(j))|ck) Is shown at ckFeature in class x(i)And feature x(j)Probability of co-occurrence, p (x)(i),x(j)) Represents a feature x(i)And feature x(j)Along with the probability of occurrence in the feature set S.
Further, the step 2 comprises:
step 21: adding the first feature in the feature set S to the candidate set SselWill exclude the set SexcSet to an empty set, i.e. Ssel={x(1)},Sexc-the first feature corresponds to a degree of association R { }c(x(i)) Maximum;
step 22: starting with the second feature in the feature set S, with x(i)Representing said second feature, calculating feature x(i)And candidate set SselRedundancy R between all features inxAnd degree of synergy SxAnd combining the relevance R between the characteristics and the target categoryc(x(i)) Computing feature x(i)Sensitivity Sen (x) of (2)(i));
Step 23: sensitivity Sen (x)(i)) Comparing with a preset threshold th, if Sen (x)(i)) > th, then the feature x(i)Add candidate set Ssel(ii) a Else, the feature x(i)Addition of exclusion set Sexc;
Step 24: if x(i)If the last feature in the feature set S is adopted, the division is ended; otherwise, x is(i)Set to the next feature in the set S and go back to step 22.
Further, the sensitivity Sen (x)(i)) Expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))),j≠i
wherein α and β are redundancies R, respectivelyxAnd degree of synergy SxWeight of (3), min (R)x(x(i);x(j)) ) represents a feature x(i)Minimum of redundancy with the remaining features, max (S)x(x(i);x(j)) ) represents a feature x(i)Maximum value of degree of agreement with the remaining features, Sen (x)(i)) Represents a feature x(i)Sensitivity to target class C, Rc(x(i)) Represents a feature x(i)Degree of association with the target class C.
Further, the step 3 comprises:
step 31: order to be collected StbdIs empty, i.e. StbdX is { } given(k)To exclude the set SexcIs given by x(m)As a candidate set SselThe first feature of (1);
step 32: for exclusion set SexcFeature x of (1)(k)Calculating a candidate set SselFeature x of (1)(m)And x in the feature set S(m)Maximum value of the degree of cooperation between all the features except the above, namely max (S)x(x(m);x(i))),x(i)∈S,i≠m;
Step 33: if the feature x(m)Is x(k)Then x is(m)Adding pending set Stbd;
Step 34: if the feature x(m)Is the candidate set SselLast feature in and S to be settbdIf it is empty, go to step 36; if to be set StbdIf not empty, let x(j)For a pending set StbdStep 35; if the feature x(m)Not the candidate set SselIn the last feature, the feature x is then set(m)Set as candidate set SselTo the next feature, go back to step 32;
step 35: for pending set StbdFeature x of (1)(j)Updating the feature x according to the following formula(j)Sensitivity of (2):
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))),x(n)∈S,n≠j,n≠k
will be characteristic x(j)Sensitivity Sen (x) of (2)(j)) Comparing with a preset threshold th, if Sen (x)(j)) < th andthen the feature x(k)From the exclusion set SexcRemoving, adding to the candidate set SselEntering step 36; otherwise, if feature x(j)Is the pending set StbdThe last element, then go directly to step 36; otherwise, the feature x is used(j)Set to be set StbdGo to the next element, go back to step 35;
step 36: if the feature x(k)Is the exclusion set SexcThe last element in the list is returned to the current candidate set SselAnd exclusion set SexcAs a result of the final feature selection; otherwise, the feature x is used(k)Set as exclusion set SexcGo to the next element and go back to step 31.
The technical scheme of the invention has the following beneficial effects:
in the scheme, the relevance R between the features and the target classes is calculated through the feature set S and the target classes Cc(x(i)) And redundancy R between featuresxAnd degree of synergy SxThereby calculating the sensitivity Sen of the characteristics; and screening the features according to a preset threshold th, dividing the feature set into a candidate set and an exclusion set, and continuously adjusting and optimizing the candidate set and the exclusion set in the subsequent process. Thus, the characteristics andthe mutual relations among the target categories and the characteristics are selected through the relevance, the redundancy and the cooperation, the characteristics which play a key role in classification are reserved, the characteristic dimensionality and the classification complexity are reduced, and the classification accuracy can be improved.
Drawings
Fig. 1 is a schematic flowchart of a text classification feature selection method according to an embodiment of the present invention;
fig. 2 is a detailed flowchart of a text classification feature selection method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart illustrating a feature selection method for partitioning a candidate set and an exclusion set according to an embodiment of the present invention;
fig. 4 is a schematic flow chart of adjusting a candidate set and an exclusion set by a feature selection method according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The invention provides a text classification feature selection method aiming at the problems of high feature dimension or low classification precision in the prior art.
As shown in fig. 1, the method for selecting text classification features provided in the embodiment of the present invention includes:
step 1: acquiring a feature set S and a target class C, and calculating each feature x in the feature set S(i)Degree of association R with object class Cc(x(i)) And according to the degree of association Rc(x(i)) Sorting the feature set S in a descending order according to the size;
step 2: calculating characteristicsRedundancy R between every two features in the syndrome SxAnd degree of synergy SxAssociation degree R between combined features and object classesc(x(i)) Calculating the sensitivity Sen of the features, comparing the sensitivity Sen with a preset threshold th, combining the descending sorting result of the feature set S, and dividing the feature set S into a candidate set S according to the threshold thselAnd exclusion set Sexc;
And step 3: computing a candidate set SselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcAnd (6) adjusting.
The text classification feature selection method provided by the embodiment of the invention calculates the association degree R between the features and the target class through the feature set S and the target class Cc(x(i)) And redundancy R between featuresxAnd degree of synergy SxThereby calculating the sensitivity Sen of the characteristics; and screening the features according to a preset threshold th, dividing the feature set into a candidate set and an exclusion set, and continuously adjusting and optimizing the candidate set and the exclusion set in the subsequent process. Therefore, the mutual relations between the features and the target categories and between the features are comprehensively considered, the features are selected through the relevance degree, the redundancy degree and the cooperation degree, the features playing a key role in classification are reserved, the feature dimensionality and the classification complexity are reduced, and the classification accuracy can be improved.
In this embodiment, as shown in fig. 2, to acquire the feature set S and the target class C, the feature set S needs to be input first (x)(1),x(2),...,x(n)) And a target class C.
In this embodiment, the feature set S represents all features (a single feature is represented by x) in the text classification process(i)A set of representations, i.e. word vectors), i.e. S ═ x(1),x(2),...,x(n)) N represents the number of features in the feature set S; characteristic x(i)The word corresponding to the representation featureColumn vectors formed by the number of occurrences in each text file, i.e.The target category C represents a column vector formed by categories corresponding to each text file, and the target category C is a category set.
In this embodiment, the feature x(i)Degree of association R with object class Cc(x(i)) Is a characteristic x(i)And the target class C.
In this embodiment, as an optional embodiment, each feature x in the set of computing features S is calculated(i)Degree of association R with object class Cc(x(i)) And according to the degree of association Rc(x(i)) Sorting by size the set of features S in descending order (step 1) comprises:
step 11, for each feature x in the feature set S(i)According to the formula Rc(x(i))=I(x(i)(ii) a C) Computing feature x(i)Degree of association R with object class Cc(x(i)) Wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C;
step 12, according to the relevance Rc(x(i)) The size of the feature set S is determined by the size of the feature set S;
wherein x is(i)Represents the ith feature, R, in the feature set Sc(x(i)) Represents a feature x(i)Degree of association with the target class C.
In this embodiment, the
Wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the object class C, CkRepresenting the object class CThe kth class, p (x)(i),ck) Represents a feature x(i)And class ckProbability of co-occurrence, p (x)(i)|ck) Is shown at ckFeature in class x(i)Probability of occurrence, p (x)(i)) Represents a feature x(i)Probability of occurrence in the feature set S.
In this embodiment, preferably, the feature x(i)And class ckProbability of simultaneous occurrence p (x)(i),ck) From ckFeatures x in Category File(i)The frequency of occurrence of the corresponding word in all files is approximated, i.e.:
wherein,represents a feature x(i)The jth element of (i.e., feature x)(i)The number of times the corresponding word appears in the jth file);represents a feature x(i)Wherein the corresponding object class is ckM (i.e. feature x)(i)Corresponding word is at m-th ckNumber of occurrences in category file).
In this embodiment, preferably, saidkFeature in class x(i)Probability of occurrence p (x)(i)|ck) From the feature x(i)Corresponding word is in ckThe frequency of occurrence in the category file is approximated, i.e.:
in the present embodiment, preferably, the characteristicsSign x(i)Probability p (x) of occurrence in feature set S(i)) From the feature x(i)The frequency of occurrence of the corresponding word in all files is approximated, namely:
in this embodiment, as a further alternative, as shown in fig. 3, the redundancy R between every two features in the feature set S is calculatedxAnd degree of synergy SxAssociation degree R between combined features and object classesc(x(i)) Calculating the sensitivity Sen of the feature, comparing the sensitivity Sen with a preset threshold th, and dividing the feature set S into candidate sets S according to the threshold thselAnd exclusion set Sexc(step 2) comprising:
step 21: adding the first feature in the feature set S to the candidate set SselWill exclude the set SexcSet to an empty set, i.e. Ssel={x(1)},Sexc-the first feature corresponds to a degree of association R { }c(x(i)) Maximum;
step 22: starting with the second feature in the feature set S, with x(i)Representing said second feature, calculating feature x(i)And candidate set SselRedundancy R between all features inxAnd degree of synergy SxAnd combining the relevance R between the characteristics and the target categoryc(x(i)) Computing feature x(i)Sensitivity Sen (x) of (2)(i));
Step 23: sensitivity Sen (x)(i)) Comparing with a preset threshold th, if Sen (x)(i)) > th, then the feature x(i)Add candidate set Ssel(ii) a Else, the feature x(i)Addition of exclusion set Sexc;
Step 24: if x(i)If the last feature in the feature set S is adopted, the division is ended; otherwise, x is(i)Is arranged asThe next feature in the set S is returned to step 22.
In an embodiment of the foregoing text classification feature selection method, further, the redundancy RxExpressed as:
Rx(x(i);x(j))=min(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Rx(x(i);x(j)) Represents a feature x(i)And feature x(j)Redundancy between, Rx(x(i);x(j)) Is the smaller of 0 and the correlation gain.
In an embodiment of the foregoing text classification feature selection method, further, the degree of agreement SxExpressed as:
Sx(x(i);x(j))=max(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Sx(x(i);x(j)) Represents a feature x(i)And feature x(j)Degree of coordination between, Sx(x(i);x(j)) Is 0 and the larger of the correlation gains.
In an embodiment of the foregoing text classification feature selection method, further, the IG (x)(i);x(j)(ii) a C) Expressed as:
IG(x(i);x(j);C)=I[(x(i),x(j));C]-I(x(i);C)-I(x(j);C)
wherein, I (x)(i)(ii) a C) And I (x)(j)(ii) a C) And the feature x(i)Same formula for mutual information calculation between object classes C, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C; i (x)(j)(ii) a C) Represents a feature x(j)Mutual information with the target class C; i ((x)(i),x(j)(ii) a C) Represents a feature x(i)Characteristic x(j)And the target class C.
In an embodiment of the foregoing text classification feature selection method, further, the I ((x)(i),x(j)(ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),x(j)Ck) denotes the feature x(i)Characteristic x(j)And class ckProbability of co-occurrence, p ((x)(i),x(j))|ck) Is shown at ckFeature in class x(i)And feature x(j)Probability of co-occurrence, p (x)(i),x(j)) Represents a feature x(i)And feature x(j)Along with the probability of occurrence in the feature set S.
In this embodiment, preferably, the feature x(i)Characteristic x(j)And class ckProbability of simultaneous occurrence p (x)(i),x(j),ck) From ckFeatures x in Category File(i)And feature x(j)The frequency of the corresponding word appearing in all the files at the same time is approximated, namely:
wherein,represents a feature x(i)And feature x(j)Wherein the corresponding object class is ckOf the m-th element (i.e., feature x)(i)And feature x(j)The m-th c of the word corresponding to the twokThe smaller value of the number of occurrences in the category file).
In this embodiment, preferably, saidkFeature in class x(i)And feature x(j)Probability of simultaneous occurrence p ((x)(i),x(j))|ck) From the feature x(i)And feature x(j)Corresponding word is in ckThe frequency of simultaneous occurrences in the category file is approximated as:
in this embodiment, preferably, the feature x(i)And feature x(j)Probability p (x) of simultaneous occurrence in feature set S(i)) From the feature x(i)And feature x(j)The frequency of the corresponding word appearing in all the files at the same time is approximated, namely:
in an embodiment of the foregoing text classification feature selection method, further, the sensitivity Sen (x)(i)) Expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))),j≠i
wherein α and β are redundancies R, respectivelyxAnd degree of synergy SxWeight of (3), min (R)x(x(i);x(j)) ) represents a feature x(i)Minimum of redundancy with the remaining features, max (S)x(x(i);x(j)) ) represents a feature x(i)Maximum value of degree of agreement with the remaining features, Sen (x)(i)) Represents a feature x(i)Sensitivity to target class C, Rc(x(i)) Represents a feature x(i)Degree of association with the target class C.
In this embodiment, as shown in fig. 4, as an alternative embodiment, the candidate set S is calculatedselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcThe adjustment (step 3) includes:
step 31: order to be collected StbdIs empty, i.e. StbdX is { } given(k)To exclude the set SexcIs given by x(m)As a candidate set SselThe first feature of (1);
step 32: for exclusion set SexcFeature x of (1)(k)Calculating a candidate set SselFeature x of (1)(m)And x in the feature set S(m)Maximum value of the degree of cooperation between all the features except the above, namely max (S)x(x(m);x(i))),x(i)∈S,i≠m;
Step 33: if the feature x(m)Is x(k)Then x is(m)Adding pending set Stbd;
Step 34: if the feature x(m)Is the candidate set SselLast feature in and S to be settbdIf it is empty, go to step 36; if to be set StbdIf not empty, let x(j)For a pending set StbdStep 35; if the feature x(m)Not the candidate set SselLast inA feature x is then set(m)Set as candidate set SselTo the next feature, go back to step 32;
step 35: for pending set StbdFeature x of (1)(j)Updating the feature x according to the following formula(j)Sensitivity of (2):
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))),x(n)∈S,n≠j,n≠k
will be characteristic x(j)Sensitivity Sen (x) of (2)(j)) Comparing with a preset threshold th, if Sen (x)(j)) < th andthen the feature x(k)From the exclusion set SexcRemoving, adding to the candidate set SselEntering step 36; otherwise, if feature x(j)Is the pending set StbdThe last element, then go directly to step 36; otherwise, the feature x is used(j)Set to be set StbdGo to the next element, go back to step 35;
step 36: if the feature x(k)Is the exclusion set SexcThe last element in the list is returned to the current candidate set SselAnd exclusion set SexcAs a result of the final feature selection; otherwise, the feature x is used(k)Set as exclusion set SexcGo to the next element and go back to step 31.
In this embodiment, according to steps 31-36, a candidate set S is calculatedselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcAdjusting to obtain a new candidate set SselAnd exclusion set SexcThe effect of the removal or addition of features on the classification result can be reduced.
In this embodiment, the redundancy RxThe weight α may have a default value of 0.5, the degree of agreement SxThe weight β can be set to 0.5 as a default, the preset threshold th can be set to 0.01 as a default, and the redundancy RxWeight α, degree of synergy SxThe weight β and the preset threshold th are optimized and updated through a genetic algorithm in the subsequent training and testing processes.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (7)
1. A text classification feature selection method, comprising:
step 1: acquiring a feature set S and a target class C, and calculating each feature x in the feature set S(i)Degree of association R with object class Cc(x(i)) And according to the degree of association Rc(x(i)) Sorting the feature set S in a descending order according to the size;
step 2: calculating the redundancy R between every two features in the feature set SxAnd degree of synergy SxAssociation degree R between combined features and object classesc(x(i)) Calculating the sensitivity Sen of the features, comparing the sensitivity Sen with a preset threshold th, combining the descending sorting result of the feature set S, and dividing the feature set S into a candidate set S according to the threshold thselAnd exclusion set Sexc;
And step 3: computing a candidate set SselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcAdjusting;
wherein the redundancy RxExpressed as:
Rx(x(i);x(j))=min(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Rx(x(i);x(j)) Represents a feature x(i)And feature x(j)Redundancy between, Rx(x(i);x(j)) Is 0 and the smaller of the correlation gain;
wherein the degree of agreement SxExpressed as:
Sx(x(i);x(j))=max(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Sx(x(i);x(j)) Represents a feature x(i)And feature x(j)Degree of coordination between, Sx(x(i);x(j)) Is 0 and the larger of the correlation gain;
wherein the sensitivity Sen (x)(i)) Expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))+βmax(Sx(x(i);x(j))),j≠i
wherein α and β are redundancies R, respectivelyxAnd degree of synergy SxWeight of (3), min (R)x(x(i);x(j)) ) represents a feature x(i)Minimum of redundancy with the remaining features, max (S)x(x(i);x(j)) ) represents a feature x(i)Maximum value of degree of agreement with the remaining features, Sen (x)(i)) Represents a feature x(i)Sensitivity to target class C, Rc(x(i)) Represents a feature x(i)Degree of association with the target class C.
2. The text classification feature selection method according to claim 1, wherein the step 1 comprises:
step 11, for each feature x in the feature set S(i)According to the formula Rc(x(i))=I(x(i)(ii) a C) Computing feature x(i)Degree of association R with object class Cc(x(i)) Wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C;
step 12, according to the relevance Rc(x(i)) The size of the feature set S is determined by the size of the feature set S;
wherein x is(i)Represents the ith feature, R, in the feature set Sc(x(i)) Represents a feature x(i)Degree of association with the target class C.
3. The text classification feature selection method of claim 2, wherein I (x)(i)(ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),ck) Represents a feature x(i)And class ckThe probability of the simultaneous occurrence of the two,p(x(i)|ck) Is shown at ckFeature in class x(i)Probability of occurrence, p (x)(i)) Represents a feature x(i)Probability of occurrence in the feature set S.
4. The text classification feature selection method of claim 1, wherein the IG (x)(i);x(j)(ii) a C) Expressed as:
IG(x(i);x(j);C)=I((x(i),x(j));C)-I(x(i);C)-I(x(j);C)
wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C; i (x)(j)(ii) a C) Represents a feature x(j)Mutual information with the target class C; i ((x)(i),x(j)) (ii) a C) Represents a feature x(i)Characteristic x(j)And the target class C.
5. The method of claim 4, wherein I ((x) is selected(i),x(j)) (ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),x(j),ck) Represents a feature x(i)Characteristic x(j)And class ckProbability of co-occurrence, p ((x)(i),x(j))|ck) Is shown at ckFeature in class x(i)And feature x(j)Probability of co-occurrence, p (x)(i),x(j)) Represents a feature x(i)And feature x(j)Along with the probability of occurrence in the feature set S.
6. The text classification feature selection method according to claim 1, wherein the step 2 includes:
step 21: adding the first feature in the feature set S to the candidate set SselWill exclude the set SexcSet to an empty set, i.e. Ssel={x(1)},Sexc-the first feature corresponds to a degree of association R { }c(x(i)) Maximum;
step 22: starting with the second feature in the feature set S, with x(i)Representing said second feature, calculating feature x(i)And candidate set SselRedundancy R between all features inxAnd degree of synergy SxAnd combining the relevance R between the characteristics and the target categoryc(x(i)) Computing feature x(i)Sensitivity Sen (x) of (2)(i));
Step 23: sensitivity Sen (x)(i)) Comparing with a preset threshold th, if Sen (x)(i)) > th, then the feature x(i)Add candidate set Ssel(ii) a Else, the feature x(i)Addition of exclusion set Sexc;
Step 24: if x(i)If the last feature in the feature set S is adopted, the division is ended; otherwise, x is(i)Set to the next feature in the set S and go back to step 22.
7. The text classification feature selection method according to claim 1, wherein the step 3 comprises:
step 31: order to be collected StbdIs empty, i.e. StbdX is { } given(k)To exclude the set SexcIs given by x(m)As a candidate set SselThe first feature of (1);
step 32: for exclusion set SexcFeature x of (1)(k)Calculating a candidate set SselFeature x of (1)(m)And x in the feature set S(m)Maximum value of the degree of cooperation between all the features except the above, namely max (S)x(x(m);x(i))),x(i)∈S,i≠m;
Step 33: if the feature x(m)Is x(k)Then x is(m)Adding pending set Stbd;
Step 34: if the feature x(m)Is the candidate set SselLast feature in and S to be settbdIf it is empty, go to step 36; if to be set StbdIf not empty, let x(j)For a pending set StbdStep 35; if the feature x(m)Not the candidate set SselIn the last feature, the feature x is then set(m)Set as candidate set SselTo the next feature, go back to step 32;
step 35: for pending set StbdFeature x of (1)(j)Updating the feature x according to the following formula(j)Sensitivity of (2):
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))+βmax(Sx(x(j);x(n))),x(n)∈S,n≠j,n≠k
will be characteristic x(j)Sensitivity Sen (x) of (2)(j)) Comparing with a preset threshold th, if Sen (x)(j)) < th andthen the feature x(k)From the exclusion set SexcRemoving, adding to the candidate set SselEntering step 36; otherwise, if feature x(j)Is the pending set StbdThe last element, then go directly to step 36; otherwise, the feature x is used(j)Set to be set StbdGo to the next element, go back to step 35;
step 36: if the feature x(k)Is the exclusion set SexcThe last element in the list is returned to the current candidate set SselAnd exclusion set SexcAs a result of the final feature selection; otherwise, the feature x is used(k)Set as exclusion set SexcGo to the next element and go back to step 31.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710181572.8A CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710181572.8A CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107016073A CN107016073A (en) | 2017-08-04 |
CN107016073B true CN107016073B (en) | 2019-06-28 |
Family
ID=59445053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710181572.8A Active CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107016073B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109934251B (en) * | 2018-12-27 | 2021-08-06 | 国家计算机网络与信息安全管理中心广东分中心 | Method, system and storage medium for recognizing text in Chinese language |
CN111612385B (en) * | 2019-02-22 | 2024-04-16 | 北京京东振世信息技术有限公司 | Method and device for clustering articles to be distributed |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105184323A (en) * | 2015-09-15 | 2015-12-23 | 广州唯品会信息科技有限公司 | Feature selection method and system |
CN105260437A (en) * | 2015-09-30 | 2016-01-20 | 陈一飞 | Text classification feature selection method and application thereof to biomedical text classification |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8473451B1 (en) * | 2004-07-30 | 2013-06-25 | At&T Intellectual Property I, L.P. | Preserving privacy in natural language databases |
-
2017
- 2017-03-24 CN CN201710181572.8A patent/CN107016073B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105184323A (en) * | 2015-09-15 | 2015-12-23 | 广州唯品会信息科技有限公司 | Feature selection method and system |
CN105260437A (en) * | 2015-09-30 | 2016-01-20 | 陈一飞 | Text classification feature selection method and application thereof to biomedical text classification |
Non-Patent Citations (1)
Title |
---|
中文文本分类中的特征选择研究;周茜 等;《中文信息学报》;20041231;第18卷(第3期);第17-23页 |
Also Published As
Publication number | Publication date |
---|---|
CN107016073A (en) | 2017-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7610282B1 (en) | Rank-adjusted content items | |
US20200057958A1 (en) | Identification and application of hyperparameters for machine learning | |
US7680768B2 (en) | Information processing apparatus and method, program, and storage medium | |
CN108228541B (en) | Method and device for generating document abstract | |
US9256649B2 (en) | Method and system of filtering and recommending documents | |
CN109948125B (en) | Method and system for improved Simhash algorithm in text deduplication | |
US10353925B2 (en) | Document classification device, document classification method, and computer readable medium | |
CN105760526B (en) | A kind of method and apparatus of news category | |
CN111428138A (en) | Course recommendation method, system, equipment and storage medium | |
CN106708929B (en) | Video program searching method and device | |
US10387805B2 (en) | System and method for ranking news feeds | |
WO2022089467A1 (en) | Video data sorting method and apparatus, computer device, and storage medium | |
CN110334356A (en) | Article matter method for determination of amount, article screening technique and corresponding device | |
CN103593425A (en) | Intelligent retrieval method and system based on preference | |
US9269057B1 (en) | Using specialized workers to improve performance in machine learning | |
WO2011130526A1 (en) | Ascribing actionable attributes to data that describes a personal identity | |
CN106570196B (en) | Video program searching method and device | |
CN111680152A (en) | Method and device for extracting abstract of target text, electronic equipment and storage medium | |
CN107016073B (en) | A kind of text classification feature selection approach | |
US11663520B1 (en) | Regularization relaxation scheme | |
CN109716660A (en) | Data compression device and method | |
CN109933691A (en) | Method, apparatus, equipment and storage medium for content retrieval | |
CN107133321B (en) | Method and device for analyzing search characteristics of page | |
KR20130045054A (en) | Keyword extracting and refining system, and method thereof | |
CN112733006B (en) | User portrait generation method, device and equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |