[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN107016073B - A kind of text classification feature selection approach - Google Patents

A kind of text classification feature selection approach Download PDF

Info

Publication number
CN107016073B
CN107016073B CN201710181572.8A CN201710181572A CN107016073B CN 107016073 B CN107016073 B CN 107016073B CN 201710181572 A CN201710181572 A CN 201710181572A CN 107016073 B CN107016073 B CN 107016073B
Authority
CN
China
Prior art keywords
feature
degree
sel
exc
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710181572.8A
Other languages
Chinese (zh)
Other versions
CN107016073A (en
Inventor
张晓彤
余伟伟
刘喆
王璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201710181572.8A priority Critical patent/CN107016073B/en
Publication of CN107016073A publication Critical patent/CN107016073A/en
Application granted granted Critical
Publication of CN107016073B publication Critical patent/CN107016073B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of text classification feature selection approach, can reduce characteristic dimension and complicated classification degree and improves classification accuracy.The described method includes: obtaining feature set S and target category C, each feature x in feature set S is calculated(i)Degree of association R between target category Cc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;Calculate the redundancy R in feature set S between every two featurexWith collaborative e-commerce Sx, degree of association R between binding characteristic and target categoryc(x(i)) the sensitivity S en that calculates feature, and by it compared with preset threshold value th, in conjunction with the descending sort to feature set S as a result, feature set S is divided into Candidate Set S according to threshold value thselCollect S with exclusionexc;Calculate Candidate Set SselCollect S with exclusionexcIn feature between sensitivity S en, and by it compared with preset threshold value th, according to threshold value th to Candidate Set SselCollect S with exclusionexcIt is adjusted.The present invention is suitable for machine learning text classification field.

Description

Text classification feature selection method
Technical Field
The invention relates to the field of machine learning text classification, in particular to a text classification feature selection method.
Background
With the continuous expansion of the internet scale, the information resources gathered in the internet are also increasing. Content-based information retrieval and data mining have been of interest for effective management and convenient utilization of these information resources. Text classification techniques are an important basis for information retrieval and text data mining, with the main task of discriminating unknown classes of words and documents into one or more of a predetermined class based on their content. However, the two characteristics of large training sample number and high vector dimension determine that text classification is a machine learning problem with high operation time and space complexity. Therefore, feature selection is required to reduce feature dimension while ensuring classification performance as much as possible.
Feature selection is an important data preprocessing process, and in a common text classification feature selection method, a Chi-Square test (Chi-Square) selects words with large deviation from hypothesis as features by establishing a null hypothesis and assuming that the words are not related to target categories. It only counts whether a word appears in the document, regardless of the number of occurrences, which makes it biased towards low-frequency words. Mutual Information (Mutual Information) methods select features by the amount of Information brought to the target class by the presence of quantifier words. But it only considers the degree of association between the words and the target categories, neglecting possible dependencies between the words. The TF-IDF (Term Frequency-Inverse Document Frequency) method comprehensively considers the Frequency of the words appearing in the files and the distribution of the words in all the files to evaluate the importance degree of the words, so as to select the characteristics. But it is simply the more important it is to think that words with small text frequency are and the less useful it is that words with large text frequency are, so the accuracy is not very high. In addition, feature selection methods such as information gain, dominance rate, text evidence weight, expected cross entropy and the like are also provided, most of the methods only consider the correlation degree between words and target categories or the correlation degree between words, and the problems of insufficient dimension reduction degree or low classification precision are easily caused.
Disclosure of Invention
The invention aims to provide a text classification feature selection method to solve the problems of high feature dimension or low classification precision in the prior art.
In order to solve the above technical problem, an embodiment of the present invention provides a text classification feature selection method, including:
step 1: acquiring a feature set S and a target class C, and calculating each feature x in the feature set S(i)Degree of association R with object class Cc(x(i)) And according to the degree of association RcSorting the feature set S in a descending order according to the size;
step 2: calculating the redundancy R between every two features in the feature set SxAnd degree of synergy SxAssociation degree R between combined features and object classesc(x(i)) Calculating the sensitivity Sen of the features, comparing the sensitivity Sen with a preset threshold th, combining the descending sorting result of the feature set S, and dividing the feature set S into a candidate set S according to the threshold thselAnd exclusion set Sexc
And step 3: computing a candidate set SselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcAnd (6) adjusting.
Further, the step 1 comprises:
step 11, for each of the feature sets SA feature x(i)According to the formula Rc(x(i))=I(x(i)(ii) a C) Computing feature x(i)Degree of association R with object class Cc(x(i)) Wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C;
step 12, according to the relevance Rc(x(i)) The size of the feature set S is determined by the size of the feature set S;
wherein x is(i)Represents the ith feature, R, in the feature set Sc(x(i)) Represents a feature x(i)Degree of association with the target class C.
Further, the I (x)(i)(ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),ck) Represents a feature x(i)And class ckProbability of co-occurrence, p (x)(i)|ck) Is shown at ckFeature in class x(i)Probability of occurrence, p (x)(i)) Represents a feature x(i)Probability of occurrence in the feature set S.
Further, the redundancy RxExpressed as:
Rx(x(i);x(j))=min(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Rx(x(i);x(j)) Represents a feature x(i)And feature x(j)Redundancy between, Rx(x(i);x(j)) Is the smaller of 0 and the correlation gain.
Further, the degree of agreement SxExpressed as:
Sx(x(i);x(j))=max(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Sx(x(i);x(j)) Represents a feature x(i)And feature x(j)Degree of coordination between, Sx(x(i);x(j)) Is 0 and the larger of the correlation gains.
Further, the IG (x)(i);x(j)(ii) a C) Expressed as:
IG(x(i);x(j);C)=I[(x(i),x(j));C]-I(x(i);C)-I(x(j);C)
wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C; i (x)(j)(ii) a C) Represents a feature x(j)Mutual information with the target class C; i ((x)(i),x(j)(ii) a C) Represents a feature x(i)Characteristic x(j)And the target class C.
Further, the I ((x)(i),x(j)(ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),x(j),ck) Represents a feature x(i)Characteristic x(j)And class ckProbability of co-occurrence, p ((x)(i),x(j))|ck) Is shown at ckFeature in class x(i)And feature x(j)Probability of co-occurrence, p (x)(i),x(j)) Represents a feature x(i)And feature x(j)Along with the probability of occurrence in the feature set S.
Further, the step 2 comprises:
step 21: adding the first feature in the feature set S to the candidate set SselWill exclude the set SexcSet to an empty set, i.e. Ssel={x(1)},Sexc-the first feature corresponds to a degree of association R { }c(x(i)) Maximum;
step 22: starting with the second feature in the feature set S, with x(i)Representing said second feature, calculating feature x(i)And candidate set SselRedundancy R between all features inxAnd degree of synergy SxAnd combining the relevance R between the characteristics and the target categoryc(x(i)) Computing feature x(i)Sensitivity Sen (x) of (2)(i));
Step 23: sensitivity Sen (x)(i)) Comparing with a preset threshold th, if Sen (x)(i)) > th, then the feature x(i)Add candidate set Ssel(ii) a Else, the feature x(i)Addition of exclusion set Sexc
Step 24: if x(i)If the last feature in the feature set S is adopted, the division is ended; otherwise, x is(i)Set to the next feature in the set S and go back to step 22.
Further, the sensitivity Sen (x)(i)) Expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))),j≠i
wherein α and β are redundancies R, respectivelyxAnd degree of synergy SxWeight of (3), min (R)x(x(i);x(j)) ) represents a feature x(i)Minimum of redundancy with the remaining features, max (S)x(x(i);x(j)) ) represents a feature x(i)Maximum value of degree of agreement with the remaining features, Sen (x)(i)) Represents a feature x(i)Sensitivity to target class C, Rc(x(i)) Represents a feature x(i)Degree of association with the target class C.
Further, the step 3 comprises:
step 31: order to be collected StbdIs empty, i.e. StbdX is { } given(k)To exclude the set SexcIs given by x(m)As a candidate set SselThe first feature of (1);
step 32: for exclusion set SexcFeature x of (1)(k)Calculating a candidate set SselFeature x of (1)(m)And x in the feature set S(m)Maximum value of the degree of cooperation between all the features except the above, namely max (S)x(x(m);x(i))),x(i)∈S,i≠m;
Step 33: if the feature x(m)Is x(k)Then x is(m)Adding pending set Stbd
Step 34: if the feature x(m)Is the candidate set SselLast feature in and S to be settbdIf it is empty, go to step 36; if to be set StbdIf not empty, let x(j)For a pending set StbdStep 35; if the feature x(m)Not the candidate set SselIn the last feature, the feature x is then set(m)Set as candidate set SselTo the next feature, go back to step 32;
step 35: for pending set StbdFeature x of (1)(j)Updating the feature x according to the following formula(j)Sensitivity of (2):
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))),x(n)∈S,n≠j,n≠k
will be characteristic x(j)Sensitivity Sen (x) of (2)(j)) Comparing with a preset threshold th, if Sen (x)(j)) < th andthen the feature x(k)From the exclusion set SexcRemoving, adding to the candidate set SselEntering step 36; otherwise, if feature x(j)Is the pending set StbdThe last element, then go directly to step 36; otherwise, the feature x is used(j)Set to be set StbdGo to the next element, go back to step 35;
step 36: if the feature x(k)Is the exclusion set SexcThe last element in the list is returned to the current candidate set SselAnd exclusion set SexcAs a result of the final feature selection; otherwise, the feature x is used(k)Set as exclusion set SexcGo to the next element and go back to step 31.
The technical scheme of the invention has the following beneficial effects:
in the scheme, the relevance R between the features and the target classes is calculated through the feature set S and the target classes Cc(x(i)) And redundancy R between featuresxAnd degree of synergy SxThereby calculating the sensitivity Sen of the characteristics; and screening the features according to a preset threshold th, dividing the feature set into a candidate set and an exclusion set, and continuously adjusting and optimizing the candidate set and the exclusion set in the subsequent process. Thus, the characteristics andthe mutual relations among the target categories and the characteristics are selected through the relevance, the redundancy and the cooperation, the characteristics which play a key role in classification are reserved, the characteristic dimensionality and the classification complexity are reduced, and the classification accuracy can be improved.
Drawings
Fig. 1 is a schematic flowchart of a text classification feature selection method according to an embodiment of the present invention;
fig. 2 is a detailed flowchart of a text classification feature selection method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart illustrating a feature selection method for partitioning a candidate set and an exclusion set according to an embodiment of the present invention;
fig. 4 is a schematic flow chart of adjusting a candidate set and an exclusion set by a feature selection method according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The invention provides a text classification feature selection method aiming at the problems of high feature dimension or low classification precision in the prior art.
As shown in fig. 1, the method for selecting text classification features provided in the embodiment of the present invention includes:
step 1: acquiring a feature set S and a target class C, and calculating each feature x in the feature set S(i)Degree of association R with object class Cc(x(i)) And according to the degree of association Rc(x(i)) Sorting the feature set S in a descending order according to the size;
step 2: calculating characteristicsRedundancy R between every two features in the syndrome SxAnd degree of synergy SxAssociation degree R between combined features and object classesc(x(i)) Calculating the sensitivity Sen of the features, comparing the sensitivity Sen with a preset threshold th, combining the descending sorting result of the feature set S, and dividing the feature set S into a candidate set S according to the threshold thselAnd exclusion set Sexc
And step 3: computing a candidate set SselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcAnd (6) adjusting.
The text classification feature selection method provided by the embodiment of the invention calculates the association degree R between the features and the target class through the feature set S and the target class Cc(x(i)) And redundancy R between featuresxAnd degree of synergy SxThereby calculating the sensitivity Sen of the characteristics; and screening the features according to a preset threshold th, dividing the feature set into a candidate set and an exclusion set, and continuously adjusting and optimizing the candidate set and the exclusion set in the subsequent process. Therefore, the mutual relations between the features and the target categories and between the features are comprehensively considered, the features are selected through the relevance degree, the redundancy degree and the cooperation degree, the features playing a key role in classification are reserved, the feature dimensionality and the classification complexity are reduced, and the classification accuracy can be improved.
In this embodiment, as shown in fig. 2, to acquire the feature set S and the target class C, the feature set S needs to be input first (x)(1),x(2),...,x(n)) And a target class C.
In this embodiment, the feature set S represents all features (a single feature is represented by x) in the text classification process(i)A set of representations, i.e. word vectors), i.e. S ═ x(1),x(2),...,x(n)) N represents the number of features in the feature set S; characteristic x(i)The word corresponding to the representation featureColumn vectors formed by the number of occurrences in each text file, i.e.The target category C represents a column vector formed by categories corresponding to each text file, and the target category C is a category set.
In this embodiment, the feature x(i)Degree of association R with object class Cc(x(i)) Is a characteristic x(i)And the target class C.
In this embodiment, as an optional embodiment, each feature x in the set of computing features S is calculated(i)Degree of association R with object class Cc(x(i)) And according to the degree of association Rc(x(i)) Sorting by size the set of features S in descending order (step 1) comprises:
step 11, for each feature x in the feature set S(i)According to the formula Rc(x(i))=I(x(i)(ii) a C) Computing feature x(i)Degree of association R with object class Cc(x(i)) Wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C;
step 12, according to the relevance Rc(x(i)) The size of the feature set S is determined by the size of the feature set S;
wherein x is(i)Represents the ith feature, R, in the feature set Sc(x(i)) Represents a feature x(i)Degree of association with the target class C.
In this embodiment, the
Wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the object class C, CkRepresenting the object class CThe kth class, p (x)(i),ck) Represents a feature x(i)And class ckProbability of co-occurrence, p (x)(i)|ck) Is shown at ckFeature in class x(i)Probability of occurrence, p (x)(i)) Represents a feature x(i)Probability of occurrence in the feature set S.
In this embodiment, preferably, the feature x(i)And class ckProbability of simultaneous occurrence p (x)(i),ck) From ckFeatures x in Category File(i)The frequency of occurrence of the corresponding word in all files is approximated, i.e.:
wherein,represents a feature x(i)The jth element of (i.e., feature x)(i)The number of times the corresponding word appears in the jth file);represents a feature x(i)Wherein the corresponding object class is ckM (i.e. feature x)(i)Corresponding word is at m-th ckNumber of occurrences in category file).
In this embodiment, preferably, saidkFeature in class x(i)Probability of occurrence p (x)(i)|ck) From the feature x(i)Corresponding word is in ckThe frequency of occurrence in the category file is approximated, i.e.:
in the present embodiment, preferably, the characteristicsSign x(i)Probability p (x) of occurrence in feature set S(i)) From the feature x(i)The frequency of occurrence of the corresponding word in all files is approximated, namely:
in this embodiment, as a further alternative, as shown in fig. 3, the redundancy R between every two features in the feature set S is calculatedxAnd degree of synergy SxAssociation degree R between combined features and object classesc(x(i)) Calculating the sensitivity Sen of the feature, comparing the sensitivity Sen with a preset threshold th, and dividing the feature set S into candidate sets S according to the threshold thselAnd exclusion set Sexc(step 2) comprising:
step 21: adding the first feature in the feature set S to the candidate set SselWill exclude the set SexcSet to an empty set, i.e. Ssel={x(1)},Sexc-the first feature corresponds to a degree of association R { }c(x(i)) Maximum;
step 22: starting with the second feature in the feature set S, with x(i)Representing said second feature, calculating feature x(i)And candidate set SselRedundancy R between all features inxAnd degree of synergy SxAnd combining the relevance R between the characteristics and the target categoryc(x(i)) Computing feature x(i)Sensitivity Sen (x) of (2)(i));
Step 23: sensitivity Sen (x)(i)) Comparing with a preset threshold th, if Sen (x)(i)) > th, then the feature x(i)Add candidate set Ssel(ii) a Else, the feature x(i)Addition of exclusion set Sexc
Step 24: if x(i)If the last feature in the feature set S is adopted, the division is ended; otherwise, x is(i)Is arranged asThe next feature in the set S is returned to step 22.
In an embodiment of the foregoing text classification feature selection method, further, the redundancy RxExpressed as:
Rx(x(i);x(j))=min(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Rx(x(i);x(j)) Represents a feature x(i)And feature x(j)Redundancy between, Rx(x(i);x(j)) Is the smaller of 0 and the correlation gain.
In an embodiment of the foregoing text classification feature selection method, further, the degree of agreement SxExpressed as:
Sx(x(i);x(j))=max(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Sx(x(i);x(j)) Represents a feature x(i)And feature x(j)Degree of coordination between, Sx(x(i);x(j)) Is 0 and the larger of the correlation gains.
In an embodiment of the foregoing text classification feature selection method, further, the IG (x)(i);x(j)(ii) a C) Expressed as:
IG(x(i);x(j);C)=I[(x(i),x(j));C]-I(x(i);C)-I(x(j);C)
wherein, I (x)(i)(ii) a C) And I (x)(j)(ii) a C) And the feature x(i)Same formula for mutual information calculation between object classes C, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C; i (x)(j)(ii) a C) Represents a feature x(j)Mutual information with the target class C; i ((x)(i),x(j)(ii) a C) Represents a feature x(i)Characteristic x(j)And the target class C.
In an embodiment of the foregoing text classification feature selection method, further, the I ((x)(i),x(j)(ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),x(j)Ck) denotes the feature x(i)Characteristic x(j)And class ckProbability of co-occurrence, p ((x)(i),x(j))|ck) Is shown at ckFeature in class x(i)And feature x(j)Probability of co-occurrence, p (x)(i),x(j)) Represents a feature x(i)And feature x(j)Along with the probability of occurrence in the feature set S.
In this embodiment, preferably, the feature x(i)Characteristic x(j)And class ckProbability of simultaneous occurrence p (x)(i),x(j),ck) From ckFeatures x in Category File(i)And feature x(j)The frequency of the corresponding word appearing in all the files at the same time is approximated, namely:
wherein,represents a feature x(i)And feature x(j)Wherein the corresponding object class is ckOf the m-th element (i.e., feature x)(i)And feature x(j)The m-th c of the word corresponding to the twokThe smaller value of the number of occurrences in the category file).
In this embodiment, preferably, saidkFeature in class x(i)And feature x(j)Probability of simultaneous occurrence p ((x)(i),x(j))|ck) From the feature x(i)And feature x(j)Corresponding word is in ckThe frequency of simultaneous occurrences in the category file is approximated as:
in this embodiment, preferably, the feature x(i)And feature x(j)Probability p (x) of simultaneous occurrence in feature set S(i)) From the feature x(i)And feature x(j)The frequency of the corresponding word appearing in all the files at the same time is approximated, namely:
in an embodiment of the foregoing text classification feature selection method, further, the sensitivity Sen (x)(i)) Expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))),j≠i
wherein α and β are redundancies R, respectivelyxAnd degree of synergy SxWeight of (3), min (R)x(x(i);x(j)) ) represents a feature x(i)Minimum of redundancy with the remaining features, max (S)x(x(i);x(j)) ) represents a feature x(i)Maximum value of degree of agreement with the remaining features, Sen (x)(i)) Represents a feature x(i)Sensitivity to target class C, Rc(x(i)) Represents a feature x(i)Degree of association with the target class C.
In this embodiment, as shown in fig. 4, as an alternative embodiment, the candidate set S is calculatedselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcThe adjustment (step 3) includes:
step 31: order to be collected StbdIs empty, i.e. StbdX is { } given(k)To exclude the set SexcIs given by x(m)As a candidate set SselThe first feature of (1);
step 32: for exclusion set SexcFeature x of (1)(k)Calculating a candidate set SselFeature x of (1)(m)And x in the feature set S(m)Maximum value of the degree of cooperation between all the features except the above, namely max (S)x(x(m);x(i))),x(i)∈S,i≠m;
Step 33: if the feature x(m)Is x(k)Then x is(m)Adding pending set Stbd
Step 34: if the feature x(m)Is the candidate set SselLast feature in and S to be settbdIf it is empty, go to step 36; if to be set StbdIf not empty, let x(j)For a pending set StbdStep 35; if the feature x(m)Not the candidate set SselLast inA feature x is then set(m)Set as candidate set SselTo the next feature, go back to step 32;
step 35: for pending set StbdFeature x of (1)(j)Updating the feature x according to the following formula(j)Sensitivity of (2):
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))),x(n)∈S,n≠j,n≠k
will be characteristic x(j)Sensitivity Sen (x) of (2)(j)) Comparing with a preset threshold th, if Sen (x)(j)) < th andthen the feature x(k)From the exclusion set SexcRemoving, adding to the candidate set SselEntering step 36; otherwise, if feature x(j)Is the pending set StbdThe last element, then go directly to step 36; otherwise, the feature x is used(j)Set to be set StbdGo to the next element, go back to step 35;
step 36: if the feature x(k)Is the exclusion set SexcThe last element in the list is returned to the current candidate set SselAnd exclusion set SexcAs a result of the final feature selection; otherwise, the feature x is used(k)Set as exclusion set SexcGo to the next element and go back to step 31.
In this embodiment, according to steps 31-36, a candidate set S is calculatedselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcAdjusting to obtain a new candidate set SselAnd exclusion set SexcThe effect of the removal or addition of features on the classification result can be reduced.
In this embodiment, the redundancy RxThe weight α may have a default value of 0.5, the degree of agreement SxThe weight β can be set to 0.5 as a default, the preset threshold th can be set to 0.01 as a default, and the redundancy RxWeight α, degree of synergy SxThe weight β and the preset threshold th are optimized and updated through a genetic algorithm in the subsequent training and testing processes.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. A text classification feature selection method, comprising:
step 1: acquiring a feature set S and a target class C, and calculating each feature x in the feature set S(i)Degree of association R with object class Cc(x(i)) And according to the degree of association Rc(x(i)) Sorting the feature set S in a descending order according to the size;
step 2: calculating the redundancy R between every two features in the feature set SxAnd degree of synergy SxAssociation degree R between combined features and object classesc(x(i)) Calculating the sensitivity Sen of the features, comparing the sensitivity Sen with a preset threshold th, combining the descending sorting result of the feature set S, and dividing the feature set S into a candidate set S according to the threshold thselAnd exclusion set Sexc
And step 3: computing a candidate set SselAnd exclusion set SexcThe sensitivity Sen between the features in (1) is compared with a preset threshold th, and the candidate set S is subjected to comparison according to the threshold thselAnd exclusion set SexcAdjusting;
wherein the redundancy RxExpressed as:
Rx(x(i);x(j))=min(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Rx(x(i);x(j)) Represents a feature x(i)And feature x(j)Redundancy between, Rx(x(i);x(j)) Is 0 and the smaller of the correlation gain;
wherein the degree of agreement SxExpressed as:
Sx(x(i);x(j))=max(0,IG(x(i);x(j);C)),i≠j
wherein, IG (x)(i);x(j)(ii) a C) Representing the ith feature x in the feature set S(i)With the jth feature x(j)Gain of correlation between, Sx(x(i);x(j)) Represents a feature x(i)And feature x(j)Degree of coordination between, Sx(x(i);x(j)) Is 0 and the larger of the correlation gain;
wherein the sensitivity Sen (x)(i)) Expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))+βmax(Sx(x(i);x(j))),j≠i
wherein α and β are redundancies R, respectivelyxAnd degree of synergy SxWeight of (3), min (R)x(x(i);x(j)) ) represents a feature x(i)Minimum of redundancy with the remaining features, max (S)x(x(i);x(j)) ) represents a feature x(i)Maximum value of degree of agreement with the remaining features, Sen (x)(i)) Represents a feature x(i)Sensitivity to target class C, Rc(x(i)) Represents a feature x(i)Degree of association with the target class C.
2. The text classification feature selection method according to claim 1, wherein the step 1 comprises:
step 11, for each feature x in the feature set S(i)According to the formula Rc(x(i))=I(x(i)(ii) a C) Computing feature x(i)Degree of association R with object class Cc(x(i)) Wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C;
step 12, according to the relevance Rc(x(i)) The size of the feature set S is determined by the size of the feature set S;
wherein x is(i)Represents the ith feature, R, in the feature set Sc(x(i)) Represents a feature x(i)Degree of association with the target class C.
3. The text classification feature selection method of claim 2, wherein I (x)(i)(ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),ck) Represents a feature x(i)And class ckThe probability of the simultaneous occurrence of the two,p(x(i)|ck) Is shown at ckFeature in class x(i)Probability of occurrence, p (x)(i)) Represents a feature x(i)Probability of occurrence in the feature set S.
4. The text classification feature selection method of claim 1, wherein the IG (x)(i);x(j)(ii) a C) Expressed as:
IG(x(i);x(j);C)=I((x(i),x(j));C)-I(x(i);C)-I(x(j);C)
wherein, I (x)(i)(ii) a C) Represents a feature x(i)Mutual information with the target class C; i (x)(j)(ii) a C) Represents a feature x(j)Mutual information with the target class C; i ((x)(i),x(j)) (ii) a C) Represents a feature x(i)Characteristic x(j)And the target class C.
5. The method of claim 4, wherein I ((x) is selected(i),x(j)) (ii) a C) Expressed as:
wherein, ckThe kth class, p (x), representing the target class C(i),x(j),ck) Represents a feature x(i)Characteristic x(j)And class ckProbability of co-occurrence, p ((x)(i),x(j))|ck) Is shown at ckFeature in class x(i)And feature x(j)Probability of co-occurrence, p (x)(i),x(j)) Represents a feature x(i)And feature x(j)Along with the probability of occurrence in the feature set S.
6. The text classification feature selection method according to claim 1, wherein the step 2 includes:
step 21: adding the first feature in the feature set S to the candidate set SselWill exclude the set SexcSet to an empty set, i.e. Ssel={x(1)},Sexc-the first feature corresponds to a degree of association R { }c(x(i)) Maximum;
step 22: starting with the second feature in the feature set S, with x(i)Representing said second feature, calculating feature x(i)And candidate set SselRedundancy R between all features inxAnd degree of synergy SxAnd combining the relevance R between the characteristics and the target categoryc(x(i)) Computing feature x(i)Sensitivity Sen (x) of (2)(i));
Step 23: sensitivity Sen (x)(i)) Comparing with a preset threshold th, if Sen (x)(i)) > th, then the feature x(i)Add candidate set Ssel(ii) a Else, the feature x(i)Addition of exclusion set Sexc
Step 24: if x(i)If the last feature in the feature set S is adopted, the division is ended; otherwise, x is(i)Set to the next feature in the set S and go back to step 22.
7. The text classification feature selection method according to claim 1, wherein the step 3 comprises:
step 31: order to be collected StbdIs empty, i.e. StbdX is { } given(k)To exclude the set SexcIs given by x(m)As a candidate set SselThe first feature of (1);
step 32: for exclusion set SexcFeature x of (1)(k)Calculating a candidate set SselFeature x of (1)(m)And x in the feature set S(m)Maximum value of the degree of cooperation between all the features except the above, namely max (S)x(x(m);x(i))),x(i)∈S,i≠m;
Step 33: if the feature x(m)Is x(k)Then x is(m)Adding pending set Stbd
Step 34: if the feature x(m)Is the candidate set SselLast feature in and S to be settbdIf it is empty, go to step 36; if to be set StbdIf not empty, let x(j)For a pending set StbdStep 35; if the feature x(m)Not the candidate set SselIn the last feature, the feature x is then set(m)Set as candidate set SselTo the next feature, go back to step 32;
step 35: for pending set StbdFeature x of (1)(j)Updating the feature x according to the following formula(j)Sensitivity of (2):
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))+βmax(Sx(x(j);x(n))),x(n)∈S,n≠j,n≠k
will be characteristic x(j)Sensitivity Sen (x) of (2)(j)) Comparing with a preset threshold th, if Sen (x)(j)) < th andthen the feature x(k)From the exclusion set SexcRemoving, adding to the candidate set SselEntering step 36; otherwise, if feature x(j)Is the pending set StbdThe last element, then go directly to step 36; otherwise, the feature x is used(j)Set to be set StbdGo to the next element, go back to step 35;
step 36: if the feature x(k)Is the exclusion set SexcThe last element in the list is returned to the current candidate set SselAnd exclusion set SexcAs a result of the final feature selection; otherwise, the feature x is used(k)Set as exclusion set SexcGo to the next element and go back to step 31.
CN201710181572.8A 2017-03-24 2017-03-24 A kind of text classification feature selection approach Active CN107016073B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710181572.8A CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710181572.8A CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Publications (2)

Publication Number Publication Date
CN107016073A CN107016073A (en) 2017-08-04
CN107016073B true CN107016073B (en) 2019-06-28

Family

ID=59445053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710181572.8A Active CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Country Status (1)

Country Link
CN (1) CN107016073B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934251B (en) * 2018-12-27 2021-08-06 国家计算机网络与信息安全管理中心广东分中心 Method, system and storage medium for recognizing text in Chinese language
CN111612385B (en) * 2019-02-22 2024-04-16 北京京东振世信息技术有限公司 Method and device for clustering articles to be distributed

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184323A (en) * 2015-09-15 2015-12-23 广州唯品会信息科技有限公司 Feature selection method and system
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8473451B1 (en) * 2004-07-30 2013-06-25 At&T Intellectual Property I, L.P. Preserving privacy in natural language databases

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184323A (en) * 2015-09-15 2015-12-23 广州唯品会信息科技有限公司 Feature selection method and system
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
中文文本分类中的特征选择研究;周茜 等;《中文信息学报》;20041231;第18卷(第3期);第17-23页

Also Published As

Publication number Publication date
CN107016073A (en) 2017-08-04

Similar Documents

Publication Publication Date Title
US7610282B1 (en) Rank-adjusted content items
US20200057958A1 (en) Identification and application of hyperparameters for machine learning
US7680768B2 (en) Information processing apparatus and method, program, and storage medium
CN108228541B (en) Method and device for generating document abstract
US9256649B2 (en) Method and system of filtering and recommending documents
CN109948125B (en) Method and system for improved Simhash algorithm in text deduplication
US10353925B2 (en) Document classification device, document classification method, and computer readable medium
CN105760526B (en) A kind of method and apparatus of news category
CN111428138A (en) Course recommendation method, system, equipment and storage medium
CN106708929B (en) Video program searching method and device
US10387805B2 (en) System and method for ranking news feeds
WO2022089467A1 (en) Video data sorting method and apparatus, computer device, and storage medium
CN110334356A (en) Article matter method for determination of amount, article screening technique and corresponding device
CN103593425A (en) Intelligent retrieval method and system based on preference
US9269057B1 (en) Using specialized workers to improve performance in machine learning
WO2011130526A1 (en) Ascribing actionable attributes to data that describes a personal identity
CN106570196B (en) Video program searching method and device
CN111680152A (en) Method and device for extracting abstract of target text, electronic equipment and storage medium
CN107016073B (en) A kind of text classification feature selection approach
US11663520B1 (en) Regularization relaxation scheme
CN109716660A (en) Data compression device and method
CN109933691A (en) Method, apparatus, equipment and storage medium for content retrieval
CN107133321B (en) Method and device for analyzing search characteristics of page
KR20130045054A (en) Keyword extracting and refining system, and method thereof
CN112733006B (en) User portrait generation method, device and equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant