[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112418293A - Active learning sampling method based on information degree and representativeness - Google Patents

Active learning sampling method based on information degree and representativeness Download PDF

Info

Publication number
CN112418293A
CN112418293A CN202011296097.7A CN202011296097A CN112418293A CN 112418293 A CN112418293 A CN 112418293A CN 202011296097 A CN202011296097 A CN 202011296097A CN 112418293 A CN112418293 A CN 112418293A
Authority
CN
China
Prior art keywords
degree
sample
sampling
active learning
unlabeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011296097.7A
Other languages
Chinese (zh)
Inventor
何国良
王晗
黄成瑞
陈仪榕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202011296097.7A priority Critical patent/CN112418293A/en
Publication of CN112418293A publication Critical patent/CN112418293A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an active learning sampling method based on information degree and representativeness, which comprises the following steps: 1) acquiring the information degree and the representation degree of each time sequence for the multivariate time sequence in the unlabeled data set; 2) based on the information degree and the representation degree obtained by calculation in the step 1), obtaining the most valuable unmarked sample through a sampling algorithm; 3) labeling the unlabeled samples obtained by sampling in the step 2), and adding the labeled samples into a labeled data set; 4) and judging whether the stopping standard is met or not, and obtaining an updated marking data set after the stopping standard is met. The invention provides an effective sampling algorithm for extracting unlabeled time sequence samples aiming at the problem of active learning of a multivariate time sequence, and the sampling is carried out by combining the information degree and the representativeness through a double-optimization sampling algorithm, so that the sampling number of the unlabeled samples can be effectively reduced under the condition of ensuring the accuracy.

Description

Active learning sampling method based on information degree and representativeness
Technical Field
The invention relates to a data mining technology, in particular to an active learning sampling method based on information degree and representativeness.
Background
The large scale and diversity of annotation training data is crucial for the training of high quality classification models. However, in practical applications, the tagged data is usually very small and the untagged data is very large, so that it is time-consuming and expensive to tag all the data. In order to save human resources and cost, active learning becomes a research hotspot. The sampling of unlabeled samples is a key link in active learning, and for this reason, many algorithms have been proposed by scholars at home and abroad.
In order to obtain high quality labeled training data, a sampling strategy of active learning is to seek the most valuable samples in unlabeled data, where uncertainty-based sampling is a common approach. Lughofer et al, in conjunction with the generalized Takagi-Sugeno fuzzy model, propose two uncertainty-based sampling criteria. Density-based sampling methods have been investigated for the problem of potentially selecting outliers based on uncertain sampling. Mohamad et al teach a density-based criterion for the processing of dynamic data streams to correct for sampling deviations by weighting the samples to reflect true latent distributions.
To further improve the quality of labeled training data, some research has been directed to improving the diversity of training data. To reflect the diversity involved in multi-instance active learning, Wang et al have proposed two diversity criteria, including cluster diversity based and fuzzy rough set based diversity. Some studies have introduced a complex strategy to improve the sampling quality, and He et al have proposed a sampling strategy based on uncertainty and local data density ordering. In order to find a general method for the most suitable sample selection, the Du et al scholars propose a measurement method combining information degree and representativeness.
The above studies have been mainly directed to the sampling problem of active learning, however, the existing studies have not performed well on high-dimensional data, especially on multivariate time series. The time sequence data has wide application in the fields of medicine, industry and commerce, military affairs and the like, and the multivariate time sequence data has higher dimensionality, so that the multivariate time sequence data has high practical significance and value for active learning of the multivariate time sequence.
Disclosure of Invention
The invention aims to solve the technical problem of providing an active learning sampling method based on information degree and representativeness aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: an active learning sampling method based on information degree and representativeness comprises the following steps:
1) acquiring the information degree and the representation degree of each time sequence for the multivariate time sequence in the unlabeled data set;
2) based on the information degree and the representation degree obtained by calculation in the step 1), obtaining the most valuable unmarked sample through a sampling algorithm based on linear weight sum or a sampling algorithm based on indexes;
3) labeling the unlabeled samples obtained by sampling in the step 2), and adding the labeled samples into a labeled data set;
4) and judging whether the stopping standard is met or not, and obtaining an updated marking data set after the stopping standard is met.
According to the scheme, in the step 1), the information degree and the representativeness of each time sequence are obtained as follows:
respectively calculating the uncertainty, the local space density, the maximum mean difference and the global distribution kernel function of each time sequence;
and calculating the information degree according to the uncertainty and the local space density, and calculating the representation degree according to the maximum mean difference and the global distribution kernel function.
According to the scheme, the information degree is calculated according to the uncertainty and the local space density in the step 1), and the method specifically comprises the following steps:
1.1) for a certain unmarked multivariate time sequence sample U, solving the nearest neighbor positive case sample U of the sample U based on the similarity of dynamic time programmingpAnd nearest neighbor counter sample UnBased on U and UpAnd UnThe similarity of the dynamic time planning is calculated to obtain two parameters PUAnd NUBased on PU,NUAnd an information entropy formula, calculating the uncertainty UCTI of the sample U;
UCTI(U)=-(PUlogPU+NUlogNU)
wherein,
Figure BDA0002785456800000041
Figure BDA0002785456800000042
wherein, UpIs the nearest neighbor positive case of U, UnIs the nearest neighbor opposite case of U, DSim (U, U)p) Is U and UpBased on dynamic time planning, DSim (U, U)n) Is U and UnBased on similarity of dynamic time planning.
1.2) calculating k neighbors of a sample U on the whole data set, then calculating the quantity of inverse k neighbors of each point in a union set formed by the sample U and the k neighbors, and calculating the average value to obtain the local space density of the sample U;
1.3) calculating the information degree of the sample U based on the uncertainty and the local space density.
According to the scheme, the local data density LSD in step 1.2) is obtained by averaging the number of inverse k neighbors of each point in a union set formed by a sample U and k neighbors thereof, and is calculated as follows:
Figure BDA0002785456800000043
where K is the number of K neighbors, knn (U) is the K neighbors of U, rknn (X) is the inverse K neighbors of X, | rknn (X) | is the size of the inverse K neighbors of X.
According to the scheme, the information degree INFO of the unabelonged multivariate time sequence U in the step 1.3) is calculated by adopting the following formula:
Figure BDA0002785456800000051
where LSD (U, K) is the local spatial density of U, ucti (U) is the uncertainty of U, knn (U) is the K neighbors of U, | rknn (X) | is the number of inverse K neighbors of X, | rknn (U) | is the number of inverse K neighbors of U.
According to the scheme, the representation degree of the sample U without the multi-element time sequence is calculated by calculating a global distribution kernel function between two time sequences based on a state conversion equation and an initial state of the two time sequences, and calculating by using the global distribution kernel function as a kernel function of the maximum mean difference to obtain the representation degree of the sample U.
According to the scheme, the sampling algorithm in the step 2) is a sampling algorithm based on linear weight sum, and the sampling algorithm is specifically as follows: the sample sequencing based on the information degree and the representation degree is converted from a double optimization problem to a single optimization problem, a parameter alpha is introduced to balance the information degree and the representation degree, all unlabeled samples are sequenced and then the optimal sample is selected, and the objective equation is as follows:
Figure BDA0002785456800000052
the parameter α is used to weigh the information degree and the representation degree, info (X) is the information degree of X, and rep (X) is the representation degree of X.
According to the scheme, the sampling algorithm in the step 2) is an index-based sampling algorithm, and the method specifically comprises the following steps: calculating two comparative indexes, balancing the search deviation between different indexes by adopting a random ordering technology, and introducing a parameter beta to control the proportion of the two indexes. Two comparative indexes I1(X) and I2The calculation of (X) is as follows:
Figure BDA0002785456800000061
Figure BDA0002785456800000062
Figure BDA0002785456800000063
Figure BDA0002785456800000064
Figure BDA0002785456800000065
where Y prefixes X indicates that Y precedes X in the ordering.
According to the scheme, the step 3) further comprises the steps of classifying the reverse nearest neighbor sample of the unlabeled sample obtained in the step 2) and the unlabeled sample into the same class, and adding a labeled data set after labeling.
According to the scheme, the stopping standard in the step 4) is that a stable interval phi is set, and if the maximum value of the value difference between unmarked samples selected in each round of iterative sampling in the stable interval phi is smaller than a threshold value, the stopping standard is met, and iteration is stopped; wherein, the value of the unlabeled sample is calculated by the sampling algorithm in the step 2).
The invention has the following beneficial effects:
the invention provides an effective sampling algorithm for extracting unlabeled time sequence samples aiming at the active learning problem of a multivariate time sequence, two sampling standards for measuring the importance of unlabeled samples from different angles are provided by combining a double-optimization sampling algorithm with information degree and representativeness for sampling, and the information degree and the representativeness of the samples are respectively measured, so that the sampling number of the unlabeled samples can be effectively reduced under the condition of ensuring the accuracy.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic structural diagram of an embodiment of the present invention;
FIG. 2 is a diagram illustrating the accuracy of nearest neighbor classification results after a sampling algorithm based on linear weight sum extracts unlabeled data of different proportions for labeling;
FIG. 3 is a schematic diagram of F-measure values of nearest neighbor classification results after sampling algorithms based on linear weight sum extract unlabeled data of different proportions for labeling;
FIG. 4 is a schematic diagram of the maximum mean difference between a labeled dataset and a whole dataset obtained after extracting unlabeled sample labels of different proportions from a WG dataset in consideration of sample representation and in consideration of sample representation;
fig. 5 is a schematic diagram of two-dimensional visualization distribution of a labeled data set and an entire data set after a linear weight sum-based sampling algorithm for extracting unlabeled samples in different proportions from a WG data set, wherein the informativeness and the representational degree are considered simultaneously.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the present invention uses WG data set as a specific example to illustrate the effectiveness of the method, where the WG data includes 2 categories (respectively labeled as positive example and negative example), each data includes 3 variables, that is, each data includes time series of 3 variables, each time series is 315, and includes 1120 data in total. The initial state only has one marked sample data, and the rest data are unmarked data. To reduce the sensitivity of the initialization, ten tests were carried out, each with different marking data as initial state.
As shown in fig. 1, an active learning sampling method based on information degree and representativeness includes the following steps:
step 1, calculating uncertainty, local space density, maximum mean difference and global distribution kernel function of each time sequence for a plurality of time sequences in an unlabeled data set; and calculating the information degree based on the uncertainty and the local space density, and calculating the representation degree based on the maximum mean difference and the global distribution kernel function. The time series in the invention can be medical data, industrial sensor data, server system monitoring data and the like. If the time sequence is medical data, the method can effectively reduce the sampling number of unlabeled samples under the condition of ensuring the accuracy, and selects the sample with higher research value, thereby being beneficial to disease diagnosis and medical research.
Embodiments perform the computation of the information degree and the representation degree respectively for the data in the unmarked dataset. The uncertainty of the sample and the local data density of the sample are considered in the calculation of the information degree, and for the sample U which is not marked with the multivariate time sequence, the uncertainty calculation is firstly based on the similarity of dynamic time programming to calculate the nearest neighbor positive case sample U of the sample UpAnd nearest neighbor counter sample UnBased on U and UpAnd UnThe similarity of the dynamic time planning is calculated to obtain two parameters PUAnd NUBased on PU,NUAnd an information entropy formula, calculating the uncertainty UCTI of the sample U as follows:
UCTI(U)=-(PUlogPU+NUlogNU)
Figure BDA0002785456800000091
Figure BDA0002785456800000092
wherein, UpIs the nearest neighbor positive case of U, UnIs the nearest neighbor opposite case of U, DSim (U, U)p) Is U and UpBased on dynamic time planning, DSim (U, U)n) Is U and UnBased on similarity of dynamic time planning.
The local data density LSD is obtained by averaging the inverse k neighbor number of each point in a union set formed by a sample U and k neighbors thereof, and is calculated as follows:
Figure BDA0002785456800000093
where K is the number of K neighbors, knn (U) is the K neighbors of U, rknn (X) is the inverse K neighbors of X, | rknn (X) | is the size of the inverse K neighbors of X.
Based on the calculation of uncertainty and local information density, the information degree INFO of an unlabeled multivariate time series U is calculated as follows:
Figure BDA0002785456800000101
where LSD (U, K) is the local spatial density of U, ucti (U) is the uncertainty of U, knn (U) is the K neighbors of U, | rknn (X) | is the number of inverse K neighbors of X, | rknn (U) | is the number of inverse K neighbors of U.
For a given partial label data set D, the representation degree of an unlabeled sample U can be represented by the distribution difference of the label data sets L' and D after U is added, the distribution difference between the two data sets is measured by adopting the maximum mean difference, and the global distribution kernel function between the two time series is calculated based on the state transition equation and the initial state of the two time series according to the characteristics of the multivariate time series. For two time series X and Y of length m and n, respectively, the state transition equation is calculated as:
Figure BDA0002785456800000102
at the beginning, Mi,0And M0,jAre all 0, and M1,1Is 1. With Mn,mValue K as a global distribution kernel between X and YGA(X, Y), calculating the representative REP of the sample U by using the global distribution kernel function as the kernel function of the maximum mean difference, wherein the representative REP of the sample U is as follows:
Figure BDA0002785456800000103
wherein L' is the union of the label data set L and U, D is the whole data set, KGARepresenting a global allocation kernel.
By the above process, uncertainty and representation have been found for each unlabeled data in the WG data unlabeled dataset.
Step 2, based on the information degree and the representation degree obtained by calculation in the step 1, obtaining the most valuable unmarked sample through a sampling algorithm based on linear weight sum;
based on the informativeness and the representation degree, the most valuable unlabeled samples are selected by two sampling strategies based on dual-target optimization, namely sampling based on linear weight sum and sampling based on indexes. Sampling based on linear weight sum converts a dual-target optimization problem into a single-target optimization problem, and selects an optimal sample after sequencing all unmarked samples, wherein a target equation is as follows:
Figure BDA0002785456800000111
the parameter α is used to balance the informativeness and the representational degree, info (X) is the informativeness of X, and rep (X) is the representational degree of X.
Through the above process, the most valuable unlabeled samples among the unlabeled samples in the WG dataset have been selected.
And 3, labeling the unlabeled sample obtained in the step 2, adding the labeled sample into a labeled data set, classifying the reverse nearest neighbor sample and the unlabeled sample into the same class, and adding the labeled data set.
Through the steps, the most valuable unmarked sample is obtained, then the unmarked sample is marked, and the marked sample is added into the marked data set
Figure BDA0002785456800000112
In (1). To further enlarge the size of the labeled data set, the sample is recalculated in the unlabeled data set
Figure BDA0002785456800000121
Inverse nearest neighbor sample U in (1)*Using a semi-supervised classifier to pair U*And classifying the data into the same category as the U, and adding the data into the labeled data set. The semi-supervised classifier in the experiment employed a nearest neighbor classifier.
In specific implementation, technicians of the invention can design corresponding operation flows by themselves. For ease of reference, pseudo code is provided that suggests the establishment of rules:
Algorithm1:Framework of Active Semi-Supervised Learning
Input:a PL dataset D with two subsets
Figure BDA0002785456800000122
and
Figure BDA0002785456800000123
Output:the labeled dataset
Figure BDA0002785456800000124
1.Do
2.Select the most important unlabeled example U from
Figure BDA0002785456800000125
using the proposed sampling strategy;
3.Label the unlabeled sample U and add U into
Figure BDA0002785456800000126
4.Automatically classify the R1NN sample U*of U with semi-supervised learning and add U*into
Figure BDA0002785456800000127
5.While(stopping criterion is not satisfied);
6.Return the labeled dataset
Figure BDA0002785456800000128
in the semi-supervised active learning process, each symbol illustrates: algorithm 1 represents the Algorithm 1 of the invention, Framework of Active Semi-Supervised Learning is the name of the Algorithm 1, i.e. Semi-Supervised Active Learning Framework, Input and Output respectively represent the Algorithm1, PL represents a partial marker, D represents a multivariate time series training data set, L represents a marker data set,
Figure BDA0002785456800000131
represents an unlabeled data set, U represents the unlabeled sample selected in step 1, U*Representing the inverse nearest neighbor sample of U.
The algorithm flow is as follows: before the stopping criterion is met, the following process is iterated: firstly, selecting the most valuable unmarked sample U through the sampling algorithm in the step 1, see a line 2; then labeling U, and adding U into a marking data set L, see line 3; then, automatically classifying the reverse nearest neighbor of U into the same category as U through semi-supervised learning, and expanding a labeled data set, see line 4; and finally, judging whether the stopping standard is met or not, and obtaining an updated marking data set L after the stopping standard is met.
And 4, calculating a stopping standard, judging whether the algorithm reaches an iteration stopping condition, and obtaining an updating result of the marked data set after the algorithm reaches the iteration stopping condition.
Through the steps, the most valuable unmarked sample is selected and marked, and the marked data set is enlarged. And then judging whether the algorithm iteration reaches a stop condition, and when the value difference among the selected samples in several iterations is small, considering that no important sample exists in the rest unlabeled samples, so that the improvement on the performance of the classifier is not assisted, and actively learning to reach the stop condition. For the semi-supervised active learning framework, the following stopping criteria were introduced:
Figure BDA0002785456800000132
wherein S iskIs the k th extracted unlabelled sample in the active learning process, Utility (S)k) Is SkBased on the scores calculated by the two sampling algorithms in step 1, the parameter epsilon is a stable amplitude, and phi is a stable interval. In the experiment, the stability range ε was set to 0.001, and the stability interval was set to
Figure BDA0002785456800000141
In summary, the invention provides an active learning sampling method based on information degree and representativeness, in order to effectively extract the labeled samples, firstly, the information degree and the representativeness are respectively calculated for each multivariate time sequence, and a sampling method based on linear weight sum is introduced to select the unlabeled samples in combination with the information degree and the representativeness, so that the sampling number of the unlabeled samples can be effectively reduced under the condition of ensuring the accuracy.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (10)

1. An active learning sampling method based on information degree and representativeness is characterized by comprising the following steps:
1) acquiring the information degree and the representation degree of each time sequence for the multivariate time sequence in the unlabeled data set;
2) based on the information degree and the representation degree obtained by calculation in the step 1), obtaining the most valuable unmarked sample through a sampling algorithm based on linear weight sum or a sampling algorithm based on indexes;
3) labeling the unlabeled samples obtained by sampling in the step 2), and adding the labeled samples into a labeled data set;
4) and judging whether the stopping standard is met or not, and obtaining an updated marking data set after the stopping standard is met.
2. The active learning sampling method based on the information degree and the representation degree according to claim 1, wherein in the step 1), the information degree and the representation degree of each time series are obtained as follows:
respectively calculating the uncertainty, the local space density, the maximum mean difference and the global distribution kernel function of each time sequence;
and calculating the information degree according to the uncertainty and the local space density, and calculating the representation degree according to the maximum mean difference and the global distribution kernel function.
3. The active learning sampling method based on informativeness and representational degree according to claim 2, characterized in that, in the step 1), the informativeness is calculated according to uncertainty and local spatial density, specifically as follows:
1.1) for a certain unmarked multivariate time sequence sample U, solving the nearest neighbor positive case sample U of the sample U based on the similarity of dynamic time programmingpAnd nearest neighbor counter sample UnBased on U and UpAnd UnThe similarity of the dynamic time planning is calculated to obtain two parameters PUAnd NUBased on PU,NUAnd an information entropy formula, calculating the uncertainty UCTI of the sample U;
UCTI(U)=-(PU log PU+NU log NU)
wherein,
Figure FDA0002785456790000021
Figure FDA0002785456790000022
wherein, UpIs the nearest neighbor positive case of U, UnIs the nearest neighbor opposite case of U, DSim (U, U)p) Is U and UpBased on dynamic time planning, DSim (U, U)n) Is U and UnSimilarity based on dynamic time planning;
1.2) calculating k neighbors of a sample U on the whole data set, then calculating the quantity of inverse k neighbors of each point in a union set formed by the sample U and the k neighbors, and calculating the average value to obtain the local space density of the sample U;
1.3) calculating the information degree of the sample U based on the uncertainty and the local space density.
4. The active learning sampling method based on informativeness and representational degree of claim 3, wherein the local data density LSD in step 1.2) is obtained by the inverse k neighbor number of each point in the union set formed by the sample U and its k neighbor and averaging, and is calculated as follows:
Figure FDA0002785456790000023
where K is the number of K neighbors, knn (U) is the K neighbors of U, rknn (X) is the inverse K neighbors of X, | rknn (X) | is the size of the inverse K neighbors of X.
5. The active learning sampling method based on the informativeness and the representational degree according to claim 3, characterized in that the informativeness INFO calculation of the unabelonged multivariate time series U in step 1.3) adopts the following formula:
Figure FDA0002785456790000031
where LSD (U, K) is the local spatial density of U, ucti (U) is the uncertainty of U, knn (U) is the K neighbors of U, | rknn (X) | is the number of inverse K neighbors of X, | rknn (U) | is the number of inverse K neighbors of U.
6. The active learning sampling method based on the information degree and the representation degree according to claim 2, wherein the representation degree of the unarked multivariate time series sample U in the step 1) is calculated by calculating a global distribution kernel function between two time series based on a state transition equation and an initial state of the two time series, and calculating the representation degree of the sample U by using the global distribution kernel function as a kernel function of the maximum mean difference.
7. The active learning sampling method based on informativeness and representational degree of claim 1, wherein the sampling algorithm in the step 2) is a sampling algorithm based on linear weight sum, and specifically comprises the following steps: the sample sequencing based on the information degree and the representation degree is converted from a double optimization problem to a single optimization problem, a parameter alpha is introduced to balance the information degree and the representation degree, all unlabeled samples are sequenced and then the optimal sample is selected, and the objective equation is as follows:
Figure FDA0002785456790000041
the parameter α is used to weigh the information degree and the representation degree, info (X) is the information degree of X, and rep (X) is the representation degree of X.
8. The active learning sampling method based on the informativeness and the representativeness of claim 1, wherein the sampling algorithm in the step 2) is an index-based sampling algorithm, and the method specifically comprises the following steps: calculating two comparative indexes, balancing the search deviation between different indexes by adopting a random ordering technology, and introducing a parameter beta to control the proportion of the two indexes. Two comparative indexes I1(X) and I2The calculation of (X) is as follows:
Figure FDA0002785456790000042
Figure FDA0002785456790000043
Figure FDA0002785456790000044
Figure FDA0002785456790000045
Figure FDA0002785456790000046
where Y prefixes X indicates that Y precedes X in the ordering.
9. The active learning sampling method based on the informativeness and the representational degree of claim 1, wherein the step 3) further comprises classifying the inverse nearest neighbor sample of the unlabeled sample obtained in the step 2) and the unlabeled sample into the same class, and adding a labeled data set after labeling.
10. The active learning sampling method based on the informativeness and the representativeness of claim 1, wherein the stopping criterion in step 4) is to set a stability interval phi, and if the maximum value of the difference between the values of the unlabeled samples selected in each iteration sampling in the stability interval phi is less than a threshold value, the stopping criterion is satisfied, and the iteration is stopped; wherein, the value of the unlabeled sample is calculated by the sampling algorithm in the step 2).
CN202011296097.7A 2020-11-18 2020-11-18 Active learning sampling method based on information degree and representativeness Pending CN112418293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011296097.7A CN112418293A (en) 2020-11-18 2020-11-18 Active learning sampling method based on information degree and representativeness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011296097.7A CN112418293A (en) 2020-11-18 2020-11-18 Active learning sampling method based on information degree and representativeness

Publications (1)

Publication Number Publication Date
CN112418293A true CN112418293A (en) 2021-02-26

Family

ID=74774846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011296097.7A Pending CN112418293A (en) 2020-11-18 2020-11-18 Active learning sampling method based on information degree and representativeness

Country Status (1)

Country Link
CN (1) CN112418293A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118429625A (en) * 2024-07-05 2024-08-02 湖南大学 Kitchen waste target detection method based on active learning selection strategy

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN106991444A (en) * 2017-03-31 2017-07-28 西南石油大学 The Active Learning Method clustered based on peak density
CN111563590A (en) * 2020-04-30 2020-08-21 华南理工大学 Active learning method based on generation countermeasure model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN106991444A (en) * 2017-03-31 2017-07-28 西南石油大学 The Active Learning Method clustered based on peak density
CN111563590A (en) * 2020-04-30 2020-08-21 华南理工大学 Active learning method based on generation countermeasure model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118429625A (en) * 2024-07-05 2024-08-02 湖南大学 Kitchen waste target detection method based on active learning selection strategy

Similar Documents

Publication Publication Date Title
CN113190699A (en) Remote sensing image retrieval method and device based on category-level semantic hash
Sun et al. Global-local label correlation for partial multi-label learning
CN109376796A (en) Image classification method based on active semi-supervised learning
CN112732921B (en) False user comment detection method and system
CN104966105A (en) Robust machine error retrieving method and system
CN111325264A (en) Multi-label data classification method based on entropy
CN109447110A (en) The method of the multi-tag classification of comprehensive neighbours' label correlative character and sample characteristics
CN112199532A (en) Zero sample image retrieval method and device based on Hash coding and graph attention machine mechanism
CN111026887B (en) Cross-media retrieval method and system
CN106156805A (en) A kind of classifier training method of sample label missing data
CN110196995B (en) Complex network feature extraction method based on biased random walk
CN114943017B (en) Cross-modal retrieval method based on similarity zero sample hash
CN113377981A (en) Large-scale logistics commodity image retrieval method based on multitask deep hash learning
CN109657112A (en) A kind of cross-module state Hash learning method based on anchor point figure
CN113535947B (en) Multi-label classification method and device for incomplete data with missing labels
CN114708903A (en) Method for predicting distance between protein residues based on self-attention mechanism
Ding et al. Survey of spectral clustering based on graph theory
CN109871379A (en) A kind of online Hash K-NN search method based on data block study
CN114093445B (en) Patient screening marking method based on partial multi-marking learning
CN111046965A (en) Method for discovering and classifying potential classes in multi-label classification
CN112418293A (en) Active learning sampling method based on information degree and representativeness
CN114897085A (en) Clustering method based on closed subgraph link prediction and computer equipment
CN114328663A (en) High-dimensional theater data dimension reduction visualization processing method based on data mining
CN117114105B (en) Target object recommendation method and system based on scientific research big data information
CN108154189A (en) Grey relational cluster method based on LDTW distances

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210226