CN112418293A - Active learning sampling method based on information degree and representativeness - Google Patents
Active learning sampling method based on information degree and representativeness Download PDFInfo
- Publication number
- CN112418293A CN112418293A CN202011296097.7A CN202011296097A CN112418293A CN 112418293 A CN112418293 A CN 112418293A CN 202011296097 A CN202011296097 A CN 202011296097A CN 112418293 A CN112418293 A CN 112418293A
- Authority
- CN
- China
- Prior art keywords
- degree
- sample
- sampling
- active learning
- unlabeled
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 72
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 238000002372 labelling Methods 0.000 claims abstract description 9
- 238000005457 optimization Methods 0.000 claims abstract description 9
- 238000009826 distribution Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 15
- 230000000052 comparative effect Effects 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Public Health (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an active learning sampling method based on information degree and representativeness, which comprises the following steps: 1) acquiring the information degree and the representation degree of each time sequence for the multivariate time sequence in the unlabeled data set; 2) based on the information degree and the representation degree obtained by calculation in the step 1), obtaining the most valuable unmarked sample through a sampling algorithm; 3) labeling the unlabeled samples obtained by sampling in the step 2), and adding the labeled samples into a labeled data set; 4) and judging whether the stopping standard is met or not, and obtaining an updated marking data set after the stopping standard is met. The invention provides an effective sampling algorithm for extracting unlabeled time sequence samples aiming at the problem of active learning of a multivariate time sequence, and the sampling is carried out by combining the information degree and the representativeness through a double-optimization sampling algorithm, so that the sampling number of the unlabeled samples can be effectively reduced under the condition of ensuring the accuracy.
Description
Technical Field
The invention relates to a data mining technology, in particular to an active learning sampling method based on information degree and representativeness.
Background
The large scale and diversity of annotation training data is crucial for the training of high quality classification models. However, in practical applications, the tagged data is usually very small and the untagged data is very large, so that it is time-consuming and expensive to tag all the data. In order to save human resources and cost, active learning becomes a research hotspot. The sampling of unlabeled samples is a key link in active learning, and for this reason, many algorithms have been proposed by scholars at home and abroad.
In order to obtain high quality labeled training data, a sampling strategy of active learning is to seek the most valuable samples in unlabeled data, where uncertainty-based sampling is a common approach. Lughofer et al, in conjunction with the generalized Takagi-Sugeno fuzzy model, propose two uncertainty-based sampling criteria. Density-based sampling methods have been investigated for the problem of potentially selecting outliers based on uncertain sampling. Mohamad et al teach a density-based criterion for the processing of dynamic data streams to correct for sampling deviations by weighting the samples to reflect true latent distributions.
To further improve the quality of labeled training data, some research has been directed to improving the diversity of training data. To reflect the diversity involved in multi-instance active learning, Wang et al have proposed two diversity criteria, including cluster diversity based and fuzzy rough set based diversity. Some studies have introduced a complex strategy to improve the sampling quality, and He et al have proposed a sampling strategy based on uncertainty and local data density ordering. In order to find a general method for the most suitable sample selection, the Du et al scholars propose a measurement method combining information degree and representativeness.
The above studies have been mainly directed to the sampling problem of active learning, however, the existing studies have not performed well on high-dimensional data, especially on multivariate time series. The time sequence data has wide application in the fields of medicine, industry and commerce, military affairs and the like, and the multivariate time sequence data has higher dimensionality, so that the multivariate time sequence data has high practical significance and value for active learning of the multivariate time sequence.
Disclosure of Invention
The invention aims to solve the technical problem of providing an active learning sampling method based on information degree and representativeness aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: an active learning sampling method based on information degree and representativeness comprises the following steps:
1) acquiring the information degree and the representation degree of each time sequence for the multivariate time sequence in the unlabeled data set;
2) based on the information degree and the representation degree obtained by calculation in the step 1), obtaining the most valuable unmarked sample through a sampling algorithm based on linear weight sum or a sampling algorithm based on indexes;
3) labeling the unlabeled samples obtained by sampling in the step 2), and adding the labeled samples into a labeled data set;
4) and judging whether the stopping standard is met or not, and obtaining an updated marking data set after the stopping standard is met.
According to the scheme, in the step 1), the information degree and the representativeness of each time sequence are obtained as follows:
respectively calculating the uncertainty, the local space density, the maximum mean difference and the global distribution kernel function of each time sequence;
and calculating the information degree according to the uncertainty and the local space density, and calculating the representation degree according to the maximum mean difference and the global distribution kernel function.
According to the scheme, the information degree is calculated according to the uncertainty and the local space density in the step 1), and the method specifically comprises the following steps:
1.1) for a certain unmarked multivariate time sequence sample U, solving the nearest neighbor positive case sample U of the sample U based on the similarity of dynamic time programmingpAnd nearest neighbor counter sample UnBased on U and UpAnd UnThe similarity of the dynamic time planning is calculated to obtain two parameters PUAnd NUBased on PU,NUAnd an information entropy formula, calculating the uncertainty UCTI of the sample U;
UCTI(U)=-(PUlogPU+NUlogNU)
wherein,
wherein, UpIs the nearest neighbor positive case of U, UnIs the nearest neighbor opposite case of U, DSim (U, U)p) Is U and UpBased on dynamic time planning, DSim (U, U)n) Is U and UnBased on similarity of dynamic time planning.
1.2) calculating k neighbors of a sample U on the whole data set, then calculating the quantity of inverse k neighbors of each point in a union set formed by the sample U and the k neighbors, and calculating the average value to obtain the local space density of the sample U;
1.3) calculating the information degree of the sample U based on the uncertainty and the local space density.
According to the scheme, the local data density LSD in step 1.2) is obtained by averaging the number of inverse k neighbors of each point in a union set formed by a sample U and k neighbors thereof, and is calculated as follows:
where K is the number of K neighbors, knn (U) is the K neighbors of U, rknn (X) is the inverse K neighbors of X, | rknn (X) | is the size of the inverse K neighbors of X.
According to the scheme, the information degree INFO of the unabelonged multivariate time sequence U in the step 1.3) is calculated by adopting the following formula:
where LSD (U, K) is the local spatial density of U, ucti (U) is the uncertainty of U, knn (U) is the K neighbors of U, | rknn (X) | is the number of inverse K neighbors of X, | rknn (U) | is the number of inverse K neighbors of U.
According to the scheme, the representation degree of the sample U without the multi-element time sequence is calculated by calculating a global distribution kernel function between two time sequences based on a state conversion equation and an initial state of the two time sequences, and calculating by using the global distribution kernel function as a kernel function of the maximum mean difference to obtain the representation degree of the sample U.
According to the scheme, the sampling algorithm in the step 2) is a sampling algorithm based on linear weight sum, and the sampling algorithm is specifically as follows: the sample sequencing based on the information degree and the representation degree is converted from a double optimization problem to a single optimization problem, a parameter alpha is introduced to balance the information degree and the representation degree, all unlabeled samples are sequenced and then the optimal sample is selected, and the objective equation is as follows:
the parameter α is used to weigh the information degree and the representation degree, info (X) is the information degree of X, and rep (X) is the representation degree of X.
According to the scheme, the sampling algorithm in the step 2) is an index-based sampling algorithm, and the method specifically comprises the following steps: calculating two comparative indexes, balancing the search deviation between different indexes by adopting a random ordering technology, and introducing a parameter beta to control the proportion of the two indexes. Two comparative indexes I1(X) and I2The calculation of (X) is as follows:
where Y prefixes X indicates that Y precedes X in the ordering.
According to the scheme, the step 3) further comprises the steps of classifying the reverse nearest neighbor sample of the unlabeled sample obtained in the step 2) and the unlabeled sample into the same class, and adding a labeled data set after labeling.
According to the scheme, the stopping standard in the step 4) is that a stable interval phi is set, and if the maximum value of the value difference between unmarked samples selected in each round of iterative sampling in the stable interval phi is smaller than a threshold value, the stopping standard is met, and iteration is stopped; wherein, the value of the unlabeled sample is calculated by the sampling algorithm in the step 2).
The invention has the following beneficial effects:
the invention provides an effective sampling algorithm for extracting unlabeled time sequence samples aiming at the active learning problem of a multivariate time sequence, two sampling standards for measuring the importance of unlabeled samples from different angles are provided by combining a double-optimization sampling algorithm with information degree and representativeness for sampling, and the information degree and the representativeness of the samples are respectively measured, so that the sampling number of the unlabeled samples can be effectively reduced under the condition of ensuring the accuracy.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic structural diagram of an embodiment of the present invention;
FIG. 2 is a diagram illustrating the accuracy of nearest neighbor classification results after a sampling algorithm based on linear weight sum extracts unlabeled data of different proportions for labeling;
FIG. 3 is a schematic diagram of F-measure values of nearest neighbor classification results after sampling algorithms based on linear weight sum extract unlabeled data of different proportions for labeling;
FIG. 4 is a schematic diagram of the maximum mean difference between a labeled dataset and a whole dataset obtained after extracting unlabeled sample labels of different proportions from a WG dataset in consideration of sample representation and in consideration of sample representation;
fig. 5 is a schematic diagram of two-dimensional visualization distribution of a labeled data set and an entire data set after a linear weight sum-based sampling algorithm for extracting unlabeled samples in different proportions from a WG data set, wherein the informativeness and the representational degree are considered simultaneously.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the present invention uses WG data set as a specific example to illustrate the effectiveness of the method, where the WG data includes 2 categories (respectively labeled as positive example and negative example), each data includes 3 variables, that is, each data includes time series of 3 variables, each time series is 315, and includes 1120 data in total. The initial state only has one marked sample data, and the rest data are unmarked data. To reduce the sensitivity of the initialization, ten tests were carried out, each with different marking data as initial state.
As shown in fig. 1, an active learning sampling method based on information degree and representativeness includes the following steps:
step 1, calculating uncertainty, local space density, maximum mean difference and global distribution kernel function of each time sequence for a plurality of time sequences in an unlabeled data set; and calculating the information degree based on the uncertainty and the local space density, and calculating the representation degree based on the maximum mean difference and the global distribution kernel function. The time series in the invention can be medical data, industrial sensor data, server system monitoring data and the like. If the time sequence is medical data, the method can effectively reduce the sampling number of unlabeled samples under the condition of ensuring the accuracy, and selects the sample with higher research value, thereby being beneficial to disease diagnosis and medical research.
Embodiments perform the computation of the information degree and the representation degree respectively for the data in the unmarked dataset. The uncertainty of the sample and the local data density of the sample are considered in the calculation of the information degree, and for the sample U which is not marked with the multivariate time sequence, the uncertainty calculation is firstly based on the similarity of dynamic time programming to calculate the nearest neighbor positive case sample U of the sample UpAnd nearest neighbor counter sample UnBased on U and UpAnd UnThe similarity of the dynamic time planning is calculated to obtain two parameters PUAnd NUBased on PU,NUAnd an information entropy formula, calculating the uncertainty UCTI of the sample U as follows:
UCTI(U)=-(PUlogPU+NUlogNU)
wherein, UpIs the nearest neighbor positive case of U, UnIs the nearest neighbor opposite case of U, DSim (U, U)p) Is U and UpBased on dynamic time planning, DSim (U, U)n) Is U and UnBased on similarity of dynamic time planning.
The local data density LSD is obtained by averaging the inverse k neighbor number of each point in a union set formed by a sample U and k neighbors thereof, and is calculated as follows:
where K is the number of K neighbors, knn (U) is the K neighbors of U, rknn (X) is the inverse K neighbors of X, | rknn (X) | is the size of the inverse K neighbors of X.
Based on the calculation of uncertainty and local information density, the information degree INFO of an unlabeled multivariate time series U is calculated as follows:
where LSD (U, K) is the local spatial density of U, ucti (U) is the uncertainty of U, knn (U) is the K neighbors of U, | rknn (X) | is the number of inverse K neighbors of X, | rknn (U) | is the number of inverse K neighbors of U.
For a given partial label data set D, the representation degree of an unlabeled sample U can be represented by the distribution difference of the label data sets L' and D after U is added, the distribution difference between the two data sets is measured by adopting the maximum mean difference, and the global distribution kernel function between the two time series is calculated based on the state transition equation and the initial state of the two time series according to the characteristics of the multivariate time series. For two time series X and Y of length m and n, respectively, the state transition equation is calculated as:
at the beginning, Mi,0And M0,jAre all 0, and M1,1Is 1. With Mn,mValue K as a global distribution kernel between X and YGA(X, Y), calculating the representative REP of the sample U by using the global distribution kernel function as the kernel function of the maximum mean difference, wherein the representative REP of the sample U is as follows:
wherein L' is the union of the label data set L and U, D is the whole data set, KGARepresenting a global allocation kernel.
By the above process, uncertainty and representation have been found for each unlabeled data in the WG data unlabeled dataset.
Step 2, based on the information degree and the representation degree obtained by calculation in the step 1, obtaining the most valuable unmarked sample through a sampling algorithm based on linear weight sum;
based on the informativeness and the representation degree, the most valuable unlabeled samples are selected by two sampling strategies based on dual-target optimization, namely sampling based on linear weight sum and sampling based on indexes. Sampling based on linear weight sum converts a dual-target optimization problem into a single-target optimization problem, and selects an optimal sample after sequencing all unmarked samples, wherein a target equation is as follows:
the parameter α is used to balance the informativeness and the representational degree, info (X) is the informativeness of X, and rep (X) is the representational degree of X.
Through the above process, the most valuable unlabeled samples among the unlabeled samples in the WG dataset have been selected.
And 3, labeling the unlabeled sample obtained in the step 2, adding the labeled sample into a labeled data set, classifying the reverse nearest neighbor sample and the unlabeled sample into the same class, and adding the labeled data set.
Through the steps, the most valuable unmarked sample is obtained, then the unmarked sample is marked, and the marked sample is added into the marked data setIn (1). To further enlarge the size of the labeled data set, the sample is recalculated in the unlabeled data setInverse nearest neighbor sample U in (1)*Using a semi-supervised classifier to pair U*And classifying the data into the same category as the U, and adding the data into the labeled data set. The semi-supervised classifier in the experiment employed a nearest neighbor classifier.
In specific implementation, technicians of the invention can design corresponding operation flows by themselves. For ease of reference, pseudo code is provided that suggests the establishment of rules:
Algorithm1:Framework of Active Semi-Supervised Learning
1.Do
5.While(stopping criterion is not satisfied);
in the semi-supervised active learning process, each symbol illustrates: algorithm 1 represents the Algorithm 1 of the invention, Framework of Active Semi-Supervised Learning is the name of the Algorithm 1, i.e. Semi-Supervised Active Learning Framework, Input and Output respectively represent the Algorithm1, PL represents a partial marker, D represents a multivariate time series training data set, L represents a marker data set,represents an unlabeled data set, U represents the unlabeled sample selected in step 1, U*Representing the inverse nearest neighbor sample of U.
The algorithm flow is as follows: before the stopping criterion is met, the following process is iterated: firstly, selecting the most valuable unmarked sample U through the sampling algorithm in the step 1, see a line 2; then labeling U, and adding U into a marking data set L, see line 3; then, automatically classifying the reverse nearest neighbor of U into the same category as U through semi-supervised learning, and expanding a labeled data set, see line 4; and finally, judging whether the stopping standard is met or not, and obtaining an updated marking data set L after the stopping standard is met.
And 4, calculating a stopping standard, judging whether the algorithm reaches an iteration stopping condition, and obtaining an updating result of the marked data set after the algorithm reaches the iteration stopping condition.
Through the steps, the most valuable unmarked sample is selected and marked, and the marked data set is enlarged. And then judging whether the algorithm iteration reaches a stop condition, and when the value difference among the selected samples in several iterations is small, considering that no important sample exists in the rest unlabeled samples, so that the improvement on the performance of the classifier is not assisted, and actively learning to reach the stop condition. For the semi-supervised active learning framework, the following stopping criteria were introduced:
wherein S iskIs the k th extracted unlabelled sample in the active learning process, Utility (S)k) Is SkBased on the scores calculated by the two sampling algorithms in step 1, the parameter epsilon is a stable amplitude, and phi is a stable interval. In the experiment, the stability range ε was set to 0.001, and the stability interval was set to
In summary, the invention provides an active learning sampling method based on information degree and representativeness, in order to effectively extract the labeled samples, firstly, the information degree and the representativeness are respectively calculated for each multivariate time sequence, and a sampling method based on linear weight sum is introduced to select the unlabeled samples in combination with the information degree and the representativeness, so that the sampling number of the unlabeled samples can be effectively reduced under the condition of ensuring the accuracy.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.
Claims (10)
1. An active learning sampling method based on information degree and representativeness is characterized by comprising the following steps:
1) acquiring the information degree and the representation degree of each time sequence for the multivariate time sequence in the unlabeled data set;
2) based on the information degree and the representation degree obtained by calculation in the step 1), obtaining the most valuable unmarked sample through a sampling algorithm based on linear weight sum or a sampling algorithm based on indexes;
3) labeling the unlabeled samples obtained by sampling in the step 2), and adding the labeled samples into a labeled data set;
4) and judging whether the stopping standard is met or not, and obtaining an updated marking data set after the stopping standard is met.
2. The active learning sampling method based on the information degree and the representation degree according to claim 1, wherein in the step 1), the information degree and the representation degree of each time series are obtained as follows:
respectively calculating the uncertainty, the local space density, the maximum mean difference and the global distribution kernel function of each time sequence;
and calculating the information degree according to the uncertainty and the local space density, and calculating the representation degree according to the maximum mean difference and the global distribution kernel function.
3. The active learning sampling method based on informativeness and representational degree according to claim 2, characterized in that, in the step 1), the informativeness is calculated according to uncertainty and local spatial density, specifically as follows:
1.1) for a certain unmarked multivariate time sequence sample U, solving the nearest neighbor positive case sample U of the sample U based on the similarity of dynamic time programmingpAnd nearest neighbor counter sample UnBased on U and UpAnd UnThe similarity of the dynamic time planning is calculated to obtain two parameters PUAnd NUBased on PU,NUAnd an information entropy formula, calculating the uncertainty UCTI of the sample U;
UCTI(U)=-(PU log PU+NU log NU)
wherein,
wherein, UpIs the nearest neighbor positive case of U, UnIs the nearest neighbor opposite case of U, DSim (U, U)p) Is U and UpBased on dynamic time planning, DSim (U, U)n) Is U and UnSimilarity based on dynamic time planning;
1.2) calculating k neighbors of a sample U on the whole data set, then calculating the quantity of inverse k neighbors of each point in a union set formed by the sample U and the k neighbors, and calculating the average value to obtain the local space density of the sample U;
1.3) calculating the information degree of the sample U based on the uncertainty and the local space density.
4. The active learning sampling method based on informativeness and representational degree of claim 3, wherein the local data density LSD in step 1.2) is obtained by the inverse k neighbor number of each point in the union set formed by the sample U and its k neighbor and averaging, and is calculated as follows:
where K is the number of K neighbors, knn (U) is the K neighbors of U, rknn (X) is the inverse K neighbors of X, | rknn (X) | is the size of the inverse K neighbors of X.
5. The active learning sampling method based on the informativeness and the representational degree according to claim 3, characterized in that the informativeness INFO calculation of the unabelonged multivariate time series U in step 1.3) adopts the following formula:
where LSD (U, K) is the local spatial density of U, ucti (U) is the uncertainty of U, knn (U) is the K neighbors of U, | rknn (X) | is the number of inverse K neighbors of X, | rknn (U) | is the number of inverse K neighbors of U.
6. The active learning sampling method based on the information degree and the representation degree according to claim 2, wherein the representation degree of the unarked multivariate time series sample U in the step 1) is calculated by calculating a global distribution kernel function between two time series based on a state transition equation and an initial state of the two time series, and calculating the representation degree of the sample U by using the global distribution kernel function as a kernel function of the maximum mean difference.
7. The active learning sampling method based on informativeness and representational degree of claim 1, wherein the sampling algorithm in the step 2) is a sampling algorithm based on linear weight sum, and specifically comprises the following steps: the sample sequencing based on the information degree and the representation degree is converted from a double optimization problem to a single optimization problem, a parameter alpha is introduced to balance the information degree and the representation degree, all unlabeled samples are sequenced and then the optimal sample is selected, and the objective equation is as follows:
the parameter α is used to weigh the information degree and the representation degree, info (X) is the information degree of X, and rep (X) is the representation degree of X.
8. The active learning sampling method based on the informativeness and the representativeness of claim 1, wherein the sampling algorithm in the step 2) is an index-based sampling algorithm, and the method specifically comprises the following steps: calculating two comparative indexes, balancing the search deviation between different indexes by adopting a random ordering technology, and introducing a parameter beta to control the proportion of the two indexes. Two comparative indexes I1(X) and I2The calculation of (X) is as follows:
where Y prefixes X indicates that Y precedes X in the ordering.
9. The active learning sampling method based on the informativeness and the representational degree of claim 1, wherein the step 3) further comprises classifying the inverse nearest neighbor sample of the unlabeled sample obtained in the step 2) and the unlabeled sample into the same class, and adding a labeled data set after labeling.
10. The active learning sampling method based on the informativeness and the representativeness of claim 1, wherein the stopping criterion in step 4) is to set a stability interval phi, and if the maximum value of the difference between the values of the unlabeled samples selected in each iteration sampling in the stability interval phi is less than a threshold value, the stopping criterion is satisfied, and the iteration is stopped; wherein, the value of the unlabeled sample is calculated by the sampling algorithm in the step 2).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011296097.7A CN112418293A (en) | 2020-11-18 | 2020-11-18 | Active learning sampling method based on information degree and representativeness |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011296097.7A CN112418293A (en) | 2020-11-18 | 2020-11-18 | Active learning sampling method based on information degree and representativeness |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112418293A true CN112418293A (en) | 2021-02-26 |
Family
ID=74774846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011296097.7A Pending CN112418293A (en) | 2020-11-18 | 2020-11-18 | Active learning sampling method based on information degree and representativeness |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112418293A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118429625A (en) * | 2024-07-05 | 2024-08-02 | 湖南大学 | Kitchen waste target detection method based on active learning selection strategy |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104318242A (en) * | 2014-10-08 | 2015-01-28 | 中国人民解放军空军工程大学 | High-efficiency SVM active half-supervision learning algorithm |
CN106991444A (en) * | 2017-03-31 | 2017-07-28 | 西南石油大学 | The Active Learning Method clustered based on peak density |
CN111563590A (en) * | 2020-04-30 | 2020-08-21 | 华南理工大学 | Active learning method based on generation countermeasure model |
-
2020
- 2020-11-18 CN CN202011296097.7A patent/CN112418293A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104318242A (en) * | 2014-10-08 | 2015-01-28 | 中国人民解放军空军工程大学 | High-efficiency SVM active half-supervision learning algorithm |
CN106991444A (en) * | 2017-03-31 | 2017-07-28 | 西南石油大学 | The Active Learning Method clustered based on peak density |
CN111563590A (en) * | 2020-04-30 | 2020-08-21 | 华南理工大学 | Active learning method based on generation countermeasure model |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118429625A (en) * | 2024-07-05 | 2024-08-02 | 湖南大学 | Kitchen waste target detection method based on active learning selection strategy |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113190699A (en) | Remote sensing image retrieval method and device based on category-level semantic hash | |
Sun et al. | Global-local label correlation for partial multi-label learning | |
CN109376796A (en) | Image classification method based on active semi-supervised learning | |
CN112732921B (en) | False user comment detection method and system | |
CN104966105A (en) | Robust machine error retrieving method and system | |
CN111325264A (en) | Multi-label data classification method based on entropy | |
CN109447110A (en) | The method of the multi-tag classification of comprehensive neighbours' label correlative character and sample characteristics | |
CN112199532A (en) | Zero sample image retrieval method and device based on Hash coding and graph attention machine mechanism | |
CN111026887B (en) | Cross-media retrieval method and system | |
CN106156805A (en) | A kind of classifier training method of sample label missing data | |
CN110196995B (en) | Complex network feature extraction method based on biased random walk | |
CN114943017B (en) | Cross-modal retrieval method based on similarity zero sample hash | |
CN113377981A (en) | Large-scale logistics commodity image retrieval method based on multitask deep hash learning | |
CN109657112A (en) | A kind of cross-module state Hash learning method based on anchor point figure | |
CN113535947B (en) | Multi-label classification method and device for incomplete data with missing labels | |
CN114708903A (en) | Method for predicting distance between protein residues based on self-attention mechanism | |
Ding et al. | Survey of spectral clustering based on graph theory | |
CN109871379A (en) | A kind of online Hash K-NN search method based on data block study | |
CN114093445B (en) | Patient screening marking method based on partial multi-marking learning | |
CN111046965A (en) | Method for discovering and classifying potential classes in multi-label classification | |
CN112418293A (en) | Active learning sampling method based on information degree and representativeness | |
CN114897085A (en) | Clustering method based on closed subgraph link prediction and computer equipment | |
CN114328663A (en) | High-dimensional theater data dimension reduction visualization processing method based on data mining | |
CN117114105B (en) | Target object recommendation method and system based on scientific research big data information | |
CN108154189A (en) | Grey relational cluster method based on LDTW distances |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210226 |