[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110909889B - Training set generation and model training method and device based on feature distribution - Google Patents

Training set generation and model training method and device based on feature distribution Download PDF

Info

Publication number
CN110909889B
CN110909889B CN201911201464.8A CN201911201464A CN110909889B CN 110909889 B CN110909889 B CN 110909889B CN 201911201464 A CN201911201464 A CN 201911201464A CN 110909889 B CN110909889 B CN 110909889B
Authority
CN
China
Prior art keywords
training
distribution
training set
determining
test set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911201464.8A
Other languages
Chinese (zh)
Other versions
CN110909889A (en
Inventor
王塑
王泽荣
王亚可
刘宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Kuangjing Boxuan Technology Co ltd
Beijing Megvii Technology Co Ltd
Original Assignee
Shanghai Kuangjing Boxuan Technology Co ltd
Beijing Megvii Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Kuangjing Boxuan Technology Co ltd, Beijing Megvii Technology Co Ltd filed Critical Shanghai Kuangjing Boxuan Technology Co ltd
Priority to CN201911201464.8A priority Critical patent/CN110909889B/en
Publication of CN110909889A publication Critical patent/CN110909889A/en
Application granted granted Critical
Publication of CN110909889B publication Critical patent/CN110909889B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a training set generation and model training method and device based on feature distribution, which relate to the technical field of machine learning and comprise the following steps: acquiring a test set and a plurality of training sets, respectively extracting samples from the test set and each training set, and respectively determining the characteristics of the samples according to the extracted samples; determining feature distribution of the test set and a plurality of training sets according to the features; and extracting samples from a plurality of training sets according to the characteristic distribution to form a new training set, so as to generate the new training set with the characteristic distribution aligned with the characteristic distribution of the test set. The invention determines the relativity among the data sets by analyzing the distribution of the characteristics of the test set and the plurality of training sets in the characteristic space, determines the sampling probability of each training set, automatically samples the plurality of training sets and generates a new training set. The invention can align the data distribution of the new training set with the data distribution of the test set, and ensure that the model can obtain good performance on the test set.

Description

Training set generation and model training method and device based on feature distribution
Technical Field
The invention relates to the technical field of machine learning, in particular to a training set generation and model training method and device based on feature distribution.
Background
With the continuous development of science and technology, deep learning is widely applied in the field of computer vision, especially models such as convolutional neural networks (CNN, convolutional Neural Network) and recurrent neural networks (RNN, recurrent Neural Network), can mine relatively abstract features with relatively strong representation capability from original data, and has relatively good experimental effects in the aspects of target detection, image classification, image segmentation, video detection, human behavior recognition, voice recognition and the like.
While algorithms and models are critical in visual tasks, research is often focused on the models, the algorithms themselves. In reality, however, the role of data sets in visual tasks is becoming more and more apparent, becoming one of the important factors in visual recognition research. Particularly with the advent of the big data age, researchers have been beginning to pay more attention to the study of data sets. When the data volume contained in the data set is enough, even a simple model and algorithm can be used, and a good training effect can be obtained.
The existing neural network model training method generally divides a data set into a training set and a testing set: using the training set as a sample set for training each parameter in the neural network model; and testing the trained model by using the test set, and objectively evaluating the performance of the neural network model.
In general, in order for a model to perform well in a test set, it is necessary to keep the data of the training set and the test set consistent, respectively. In order for the model to perform well in actual use, it is desirable to keep the training set consistent with the actual data distribution. Therefore, the selection of the training set has a great influence on the learning and training effects of the neural network model. However, sometimes, the training set can be generated only according to the existing data set under the condition of insufficient data quantity, otherwise, the training effect of the model is affected if the training set contains an insufficient number of samples.
Disclosure of Invention
The present invention aims to solve the technical problems in the related art to at least a certain extent, and to achieve the above objective, an embodiment of a first aspect of the present invention provides a training set generating method based on feature distribution, which includes:
acquiring a test set and a plurality of training sets, respectively extracting samples from the test set and each training set, and respectively determining the characteristics of the test set and each training set according to the extracted samples; wherein the feature extraction is performed on samples extracted from the test set and the plurality of training sets using a pre-trained recognition model;
determining feature distribution of the test set and a plurality of training sets according to the features;
and extracting samples from a plurality of training sets according to the characteristic distribution to form a new training set, so as to generate the new training set with the characteristic distribution aligned with the characteristic distribution of the test set.
Further, the determining the feature distribution of the test set and the plurality of training sets according to the features specifically includes:
and carrying out probability density estimation according to the features, and determining the feature distribution according to the probability density estimation result.
Further, the determining the feature distribution according to the result of the probability density estimation specifically includes:
and taking the probability density distribution of the test set and each training set obtained by the probability density estimation as the characteristic distribution.
Further, the extracting samples from the plurality of training sets according to the feature distribution to form a new training set specifically includes:
determining the extraction probability of each training set according to the characteristic distribution;
and extracting samples from a plurality of training sets according to the extraction probability to form the new training set.
Further, the determining the extraction probability of each training set according to the feature distribution specifically includes:
constructing a loss function according to the characteristic distribution of each training set and each testing set;
and determining the extraction probability of each training set according to the loss function.
Further, the constructing a loss function according to the feature distribution of each of the training set and the test set specifically includes:
determining the correlation of a plurality of training sets and the test set according to the characteristic distribution;
and constructing the loss function according to the correlation.
Further, the determining the correlation between the plurality of training sets and the test set according to the feature distribution specifically includes:
determining the correlation among a plurality of training sets according to the characteristic distribution;
and determining the relevance of each training set and the test set according to the characteristic distribution.
Further, the determining the extraction probability of each training set according to the loss function specifically includes: and solving the loss function to obtain an optimal solution, and determining the extraction probability of each training set according to the optimal solution.
To achieve the above object, an embodiment of the second aspect of the present invention further provides a training set generating device based on feature distribution, including:
the acquisition module is used for acquiring a test set and a plurality of training sets, extracting samples from the test set and each training set respectively, and determining the characteristics of the test set and each training set respectively according to the extracted samples; wherein the feature extraction is performed on samples extracted from the test set and the plurality of training sets using a pre-trained recognition model;
the processing module is used for determining the characteristic distribution of the test set and a plurality of training sets according to the characteristics;
and the generating module is used for extracting samples from a plurality of training sets according to the characteristic distribution to form a new training set so as to generate the new training set with the characteristic distribution aligned with the characteristic distribution of the test set.
By using the training set generation method or device based on the characteristic distribution, the sampling probability of each training set is determined by analyzing the characteristic distribution of the test set and the plurality of training sets in the characteristic space, and the plurality of training sets are automatically sampled according to the sampling probability to generate a new training set. The invention can align the data distribution of the new training set with the data distribution of the test set, ensure the training effect of the model and obtain good performance on the test set. And a new training set can be generated according to the existing training set, so that the new training set does not need to be additionally acquired, and the acquisition cost of training data is saved.
To achieve the above object, an embodiment of a third aspect of the present invention provides a model training method based on feature distribution, including:
acquiring a test set and a plurality of training sets;
generating a new training set according to the test set and a plurality of training sets by adopting the training set generation method based on the characteristic distribution;
and training the model by using the new training set.
To achieve the above object, an embodiment of a fourth aspect of the present invention provides a model training apparatus based on feature distribution, including:
the data set acquisition module is used for acquiring a test set and a plurality of training sets;
the training set generation module is used for generating a new training set according to the test set and a plurality of training sets by adopting the training set generation method based on the characteristic distribution;
and the training module is used for training the model by using the new training set.
By using the model training method or device based on the characteristic distribution, a new training set is generated according to the existing multiple training sets by using the training set generating method based on the characteristic distribution. According to the invention, a new training set can be generated according to the existing training set, and the new training set is not required to be additionally acquired, so that the acquisition cost of training data is saved. The new training set used by the invention can ensure that the model has better training effect and can obtain better performance in the test set.
To achieve the above object, an embodiment of a fifth aspect of the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the feature distribution-based training set generating method according to the first aspect of the present invention or implements the feature distribution-based model training method according to the third aspect of the present invention.
To achieve the above object, an embodiment of a sixth aspect of the present invention provides a computing device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the feature distribution-based training set generating method according to the first aspect of the present invention or implements the feature distribution-based model training method according to the third aspect of the present invention when the processor executes the program.
The non-transitory computer-readable storage medium and the computing device according to the present invention have similar advantageous effects as the feature distribution-based training set generation method according to the first aspect of the present invention or as the feature distribution-based model training method according to the third aspect of the present invention, and will not be described in detail herein.
Drawings
FIG. 1 is a schematic diagram of a training set generation method based on feature distribution according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an identification model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a new training set formed by extracting samples from a plurality of training sets according to a feature distribution according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of determining the extraction probability of each training set according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a training set generating device based on feature distribution according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a model training method based on feature distribution according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a model training apparatus based on feature distribution according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a computing device according to an embodiment of the invention.
Detailed Description
Embodiments according to the present invention will be described in detail below with reference to the drawings, and when the description refers to the drawings, the same reference numerals in different drawings denote the same or similar elements unless otherwise indicated. It is noted that the implementations described in the following exemplary examples do not represent all implementations of the invention. They are merely examples of apparatus and methods consistent with aspects of the present disclosure as detailed in the claims and the scope of the invention is not limited thereto. Features of the various embodiments of the invention may be combined with each other without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the existing supervised machine learning methods of neural network models, most of the methods for setting training sets (training sets) and test sets (test sets) are used for training and testing the models. The data contained in the training set is used for training the model, namely, parameters such as weight and bias of the model are determined; the test set was used only once for use in evaluating the generalization ability of the final model after training was completed.
The common method for dividing the data set into the training set and the testing set is a set aside method, namely, the data set is directly divided into two mutually exclusive sets, one set is used as the training set, the rest set is used as the testing set, and samples in the data set adopt 7:3 or 8:2, and the sample was separated. In this way, the consistency of the data distribution in the training set and the test set can be maintained, so that the model trained by the data samples in the training set can obtain the best performance on the test set. However, in the practical application process, there may be a case where the training set cannot be generated by using the leave-out method due to insufficient data volume in the data set. Because if the number of samples contained in the training set is too small, this may result in poor training of the model. On the other hand, if the distribution of the training set is greatly different from the distribution of the test set, the model trained with the training set does not perform well on the test set. Typically, the test set has a similar distribution to the data in actual use, and therefore the model does not perform well in actual use.
In this regard, the present invention proposes a method for generating a training set, which generates a new training set by sampling and combining existing data sets, and uses the new training set for training a model. The similarity between the training set and the testing set is found by analyzing the distribution of the existing training set in the feature space. The similarity between the data sets is analyzed, and a training set sampling scheme is determined and obtained, so that the distribution of the training set sampling scheme in the characteristic space is as close as possible to the distribution of the test set in the characteristic space, and the performance of the trained model on the test set is better.
Fig. 1 is a schematic diagram of a training set generating method based on feature distribution according to an embodiment of the present invention, including steps S11 to S13.
In step S11, a test set and a plurality of training sets are obtained, samples are extracted from the test set and each training set, and features of the test set and each training set are determined according to the extracted samples, wherein a pre-trained recognition model is used for extracting features from the samples extracted from the test set and the plurality of training sets. In the embodiment of the invention, when a data set is acquired, if the data amount in the data set is insufficient to divide the training set and the test set by adopting a leave-out method, a sufficient amount of sample data can be left as the test set, and then part of sample data is extracted from the existing data set (for example, the training set accumulated before) to generate a new training set. At this time, after the test set and the plurality of training sets are acquired, the test set and each training set are sampled. For example, a random sampling may be used to select a proportion of samples from each data set. In the embodiment of the invention, the proportion of sampling from each data set is not limited, the sampling can be selected according to the identification model, and the characteristic distribution of the test set and each training set can be reflected to a large extent.
In some embodiments, the samples extracted from the test set and the plurality of training sets are each characterized by a pre-trained recognition model for feature extraction of sample images sampled from the test set and the plurality of training sets such that the features extracted from similar images are similar. Fig. 2 is a schematic diagram of an identification model according to an embodiment of the invention, which is exemplified by a face recognition model, but not limited thereto. In the embodiment of the invention, the recognition model is used for extracting the characteristics of the input image, so that the characteristics extracted by similar images are similar in the characteristic space. The face picture is converted to the point of the feature space, and the face corresponding to the point with the short distance is most similar to the same person in the feature space, and the point with the long distance is corresponding to different persons. For example, as shown in fig. 2, after the images of the same face a are extracted by features, the distances between the corresponding points in the feature space are also similar, and after the images of different faces B are extracted by features, the distances between the points in the feature space and the points corresponding to the images of the face a are further. The recognition model can be obtained in a pre-training mode, or other existing recognition models meeting the requirements can be used.
In step S11, the samples extracted from the test set and the plurality of training sets are identified using the above-mentioned identification model, and features of the samples extracted from the test set and each of the training sets are determined as features of the test set and the plurality of training sets, respectively. In the embodiment of the invention, it is assumed that test set X exists 0 Training set X 1 Training set X 2 Will be from test set X 0 Training set X 1 Training set X 2 The samples randomly extracted in the process are respectively obtained to be corresponding features Y through an identification model 0 、Y 1 Y is as follows 2 . Wherein from test set X 0 Each sample randomly extracted and the corresponding characteristic of each sample is a random variable conforming to a probability distribution, and is selected from the test set X 0 Subset of randomly drawn samples and test set X 0 Has consistency in probability distribution. For the slave training set X 1 、X 2 As does the samples drawn. Thus, the probability distribution of the original dataset may be represented by the probability distribution of the subset.
In step S12, feature distributions of the test set and the plurality of training sets are determined from the features. In some embodiments, specifically including: and carrying out probability density estimation according to the features, and determining the feature distribution according to the probability density estimation result.
In the embodiment of the invention, the probability density distribution of the test set and each training set is estimated by using a multi-core Gaussian function. In step S12, a multi-kernel Gaussian function is used according to the feature Y 0 、Y 1 Y is as follows 2 Estimating test set X 0 Training set X 1 Training set X 2 The probability density distribution of (2) is Y 0 :p 0 (y)、Y 1 :p 1 (Y) and Y 2 :p 2 (y) taking the probability density distribution as a characteristic distribution of the corresponding dataset. In other embodiments of the present invention, the probability density distribution may be obtained by using other probability density estimation methods, but is not limited thereto.
In step S13, samples are extracted from a plurality of the training sets according to the feature distribution to form a new training set, so as to generate a new training set with feature distribution aligned with the feature distribution of the test set. In some embodiments, the feature distribution alignment includes the feature distribution of the new training set being the same as the feature distribution of the test set. In other embodiments, the feature distribution alignment includes feature distributions of the new training set that are very close to feature distributions of the test set.
Fig. 3 is a schematic diagram of a new training set formed by extracting samples from a plurality of training sets according to the feature distribution according to an embodiment of the present invention, including steps S131 to S132.
In step S131, the extraction probability of each training set is determined according to the feature distribution. Fig. 4 is a schematic diagram of determining the extraction probability of each training set according to an embodiment of the present invention, including steps S1311 to S1312.
In step S1311, a loss function is constructed from the feature distribution of each of the training set and the test set, and in an embodiment of the present invention, the method includes determining correlations of a plurality of the training sets and the test set from the feature distribution, and constructing the loss function from the correlations. Wherein said determining the correlation of a plurality of said training sets and said test set from said feature distribution comprises: determining the correlation among a plurality of training sets according to the characteristic distribution; and determining the relevance of each training set and the test set according to the characteristic distribution.
In some embodiments, determining the correlation of the plurality of training sets and the test set based on the feature distribution includes determining a correlation a of the plurality of training sets with each other ij =∫p i (y)p j (y) dy, further determining a correlation C of each training set with the test set i =∫p i (y)p 0 (y) dy, where p i (i=1,..n.) represents probability density distributions for the plurality of training sets; p is p 0 Representing a probability density distribution of the test set; i. j denote the numbers of the training sets, i, j=1, …, n, respectively.
In the embodiment of the invention, the characteristic Y is obtained according to the steps 0 ~Y 2 To determine training set X 1 And training set X 2 Correlation A between 12 Also determine test set X 0 Respectively with training set X 1 And training set X 2 Correlation C between 1 、C 2
In an embodiment of the present invention, the loss function is constructed according to correlations between the plurality of training sets and the test set. The training set X is set with extraction probabilities P1 and P2 respectively 1 And training set X 2 Sampling to form mixed training set X merge The mixed training set X merge The characteristic distribution of (2) satisfies Y merge :p merge (y)=P 1 p 1 (y)+P 2 p 2 (y). In order to make the distance between the feature distribution of the mixed training set and the feature distribution of the test set as close as possible, a loss function L is constructed as the mixed training set X merge Feature distribution and test set X of (2) 0 The modular square of the difference in the feature distribution of (a), namely:
L=∫||p merge (y)-p 0 (y)|| 2 dy
feature distribution due to hybrid training set
Figure BDA0002295991310000091
The loss function represents L as:
Figure BDA0002295991310000092
after the above formula is developed, the following results are obtained:
Figure BDA0002295991310000093
wherein the correlation between the data sets is obtained as described above: a is that ij =∫p i (y)p j (y)dy,C i =∫p i (y)p 0 (y)dy,C 0 =∫p 0 (y)p 0 (y) dy. Thus, the loss function L can be constructed from the correlation of the data sets.
In step S1312, the extraction probability of each of the training sets is determined according to the loss function. In the embodiment of the invention, in order to obtain a better training effect, the characteristic distribution of the generated new training set is as close as possible to the characteristic distribution of the test set, i.e. the distance between the characteristic distribution of the new training set and the characteristic distribution of the test set is as close as possible. The model trained by the new training set generated in this way can be better adapted to the data of the test set, and a better test effect is obtained.
In the embodiment of the present invention, when the parameters, such as correlation between data sets, are calculated, the loss function L is the extraction probability P i Can calculate the minimum solution, so for the extraction probability P i There is an optimal estimate. I.e. according to the solved extraction probability P 1 、P 2 Separately sampling training set X 1 Training set X 2 Obtaining a mixed training set X merge For optimal selection, the characteristic distribution and the X of the test set 0 The characteristic distribution of the model is aligned, so that the model can obtain the best training effect.
In step S132, from the extracted probabilitiesThe plurality of training sets extract samples comprising the new training set. In the embodiment of the invention, the extraction probability P solved by the method is used 1 、P 2 Respectively to training set X 1 Training set X 2 Sampling, then mixing the sampled samples to obtain a mixed training set X merge And the feature distribution of the new training set is aligned with the feature distribution of the test set.
By adopting the training set generation method based on the feature distribution, the sampling probability of each training set is determined by analyzing the feature distribution of the test set and the plurality of training sets in the feature space, and the plurality of training sets are automatically sampled according to the sampling probability to generate a new training set. The invention can align the data distribution of the new training set with the data distribution of the test set, ensure the training effect of the model and obtain good performance on the test set. And the configuration can be carried out according to the existing data set, a new data set is not required to be additionally acquired, and system resources are saved.
The embodiment of the second aspect of the invention also provides a training set generating device based on the characteristic distribution. Fig. 5 is a schematic structural diagram of a training set generating device 500 based on feature distribution according to an embodiment of the present invention, which includes an obtaining module 501, a processing module 502, and a generating module 503.
The obtaining module 501 is configured to obtain a test set and a plurality of training sets, respectively, extract samples from the test set and each of the training sets, and determine features of the test set and each of the training sets according to the extracted samples, respectively, where feature extraction is performed on the samples extracted from the test set and the plurality of training sets using a pre-trained recognition model.
The processing module 502 is configured to determine a feature distribution of the test set and the plurality of training sets according to the features.
The generating module 503 is configured to extract samples from a plurality of the training sets according to the feature distribution to form a new training set, so as to generate a new training set with the feature distribution aligned with the feature distribution of the test set.
For a more specific implementation manner of each module of the training set generating device 500 based on feature distribution, reference may be made to the description of the training set generating method based on feature distribution of the present invention, and similar advantageous effects are provided, which will not be described herein.
An embodiment of a third aspect of the present invention proposes a model training method based on feature distribution. Fig. 6 is a schematic diagram of a model training method based on feature distribution according to an embodiment of the present invention, including steps S61 to S63.
In step S61, a test set and a plurality of training sets are acquired.
In step S62, a training set generating method based on the feature distribution as described above is used to generate a new training set according to the test set and the plurality of training sets.
In step S63, the model is trained using the new training set.
In the embodiment of the present invention, a verification set (verification set) may also be set. After training the model by using the new training set generated by the method, the training effect of the model is verified by using the verification set, and the method can be used for super-parameter adjustment, for example, the hidden unit number is selected in a neural network. The validation set may also be used to determine parameters of the network structure or control model complexity. In an embodiment of the invention, the validation set and the training set are independent and non-overlapping.
By adopting the model training method based on the characteristic distribution, the characteristics of the test set and the training sets are identified to determine the distribution of each data set in the characteristic space and determine the correlation among the data sets. And determining the sampling probability of each training set according to the data correlation, and automatically sampling a plurality of training sets to generate a new training set. The invention can be configured according to the existing data set, and does not need to additionally acquire a new data set, thereby saving system resources. The invention enables the characteristic distribution of the new training set to be as close as possible to the characteristic distribution of the test set, and can ensure that the model has better training effect.
Embodiments of the fourth aspect of the present invention provide a method. Fig. 7 is a schematic structural diagram of a model training apparatus 700 based on feature distribution according to an embodiment of the present invention, which includes a data set acquisition module 701, a training set generation module 702, and a training module 703.
The data set acquisition module 701 is configured to acquire a test set and a plurality of training sets.
The training set generating module 702 is configured to generate a new training set according to the test set and a plurality of training sets by using the training set generating method based on the feature distribution.
The training module 703 is configured to train the model using the new training set.
For a more specific implementation manner of each module of the model training apparatus 700 based on feature distribution, reference may be made to the description of the model training method based on feature distribution of the present invention, and similar advantageous effects will be provided, which will not be described herein.
An embodiment of the fifth aspect of the present invention proposes a non-transitory computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a feature distribution based training set generation method according to an embodiment of the first aspect of the present invention or implements a feature distribution based model training method according to an embodiment of the third aspect of the present invention.
In general, the computer instructions for carrying out the methods of the present invention may be carried in any combination of one or more computer-readable storage media. The non-transitory computer-readable storage medium may include any computer-readable medium, except the signal itself in temporary propagation.
The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer program code for carrying out operations of the present invention may be written in one or more programming languages, or combinations thereof, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" language or similar programming languages, and in particular, the Python language suitable for neural network computing and TensorFlow, pyTorch-based platform frameworks may be used. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
An embodiment of the sixth aspect of the present invention provides a computing device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the feature distribution-based training set generating method according to the embodiment of the first aspect of the present invention or implements the feature distribution-based model training method according to the embodiment of the third aspect of the present invention when the processor executes the program.
The non-transitory computer readable storage medium and the computing device according to the fifth and sixth aspects of the present invention may be implemented with reference to the details of the embodiment according to the first aspect or the embodiment according to the third aspect of the present invention, and have similar advantageous effects as the feature distribution-based training set generating method according to the embodiment of the first aspect or the feature distribution-based model training method according to the embodiment of the third aspect of the present invention, which will not be described here again.
FIG. 8 illustrates a block diagram of an exemplary computing device suitable for use in implementing embodiments of the present disclosure. The computing device 12 shown in fig. 8 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 8, computing device 12 may be implemented in the form of a general purpose computing device. Components of computing device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.
Computing device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computing device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32. Computing device 12 may further include other removable/non-removable, volatile/nonvolatile computer-readable storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in the figures and commonly referred to as a "hard disk drive"). Although not shown in fig. 8, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the various embodiments of the disclosure.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods in the embodiments described in this disclosure.
Computing device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer system/server 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computing device 12 may also communicate with one or more networks such as a local area network (Local Area Network; hereinafter: LAN), a wide area network (Wide Area Network; hereinafter: WAN) and/or a public network such as the Internet via network adapter 20. As shown, network adapter 20 communicates with other modules of computing device 12 over bus 18. It is noted that although not shown, other hardware and/or software modules may be used in connection with computing device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the methods mentioned in the foregoing embodiments.
The computing device of the present invention may be a server or a limited computing power terminal device.
While embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (8)

1. A training set generation method based on feature distribution is applied to visual recognition and is characterized by comprising the following steps:
acquiring a test set and a plurality of training sets, extracting sample images from the test set and each training set respectively, and determining the characteristics of the test set and each training set respectively according to the extracted sample images; wherein feature extraction is performed on sample images extracted from the test set and the plurality of training sets using a pre-trained recognition model;
determining feature distribution of the test set and a plurality of training sets according to the features;
extracting samples from the plurality of training sets according to the feature distribution to form a new training set to generate a new training set having feature distribution aligned with the feature distribution of the test set, wherein extracting samples from the plurality of training sets according to the feature distribution to form a new training set comprises:
determining the extraction probability of each training set according to the characteristic distribution, wherein the method specifically comprises the following steps: constructing a loss function according to the characteristic distribution of each training set and each testing set, and determining the extraction probability of each training set according to the loss function; wherein said constructing a loss function from the feature distribution of each of said training set and said test set comprises: determining the relevance of a plurality of training sets and the testing set according to the characteristic distribution, and constructing the loss function according to the relevance; wherein said determining the correlation of a plurality of said training sets and said test set from said feature distribution comprises: determining the correlation between a plurality of training sets according to the characteristic distribution, and determining the correlation between each training set and the test set according to the characteristic distribution; wherein said determining the extraction probability for each of said training sets according to said loss function comprises: solving the loss function to obtain an optimal solution, and determining the extraction probability of each training set according to the optimal solution;
and extracting samples from a plurality of training sets according to the extraction probability to form the new training set.
2. The feature distribution-based training set generation method of claim 1, wherein the determining feature distributions of the test set and the plurality of training sets from the features comprises:
and carrying out probability density estimation according to the features, and determining the feature distribution according to the probability density estimation result.
3. The feature distribution-based training set generation method according to claim 2, wherein the determining the feature distribution from the result of the probability density estimation comprises:
and taking the probability density distribution of the test set and each training set obtained by the probability density estimation as the characteristic distribution.
4. A training set generating device based on feature distribution, applied to visual recognition, comprising:
the acquisition module is used for acquiring a test set and a plurality of training sets, extracting sample images from the test set and each training set respectively, and determining the characteristics of the test set and each training set respectively according to the extracted sample images; wherein feature extraction is performed on sample images extracted from the test set and the plurality of training sets using a pre-trained recognition model;
the processing module is used for determining the characteristic distribution of the test set and a plurality of training sets according to the characteristics;
a generating module, configured to extract samples from a plurality of the training sets according to the feature distribution to form a new training set, so as to generate a new training set with feature distribution aligned with the feature distribution of the test set, where extracting samples from a plurality of the training sets according to the feature distribution to form a new training set includes:
determining the extraction probability of each training set according to the characteristic distribution, wherein the method specifically comprises the following steps: constructing a loss function according to the characteristic distribution of each training set and each testing set, and determining the extraction probability of each training set according to the loss function; wherein said constructing a loss function from the feature distribution of each of said training set and said test set comprises: determining the relevance of a plurality of training sets and the testing set according to the characteristic distribution, and constructing the loss function according to the relevance; wherein said determining the correlation of a plurality of said training sets and said test set from said feature distribution comprises: determining the correlation between a plurality of training sets according to the characteristic distribution, and determining the correlation between each training set and the test set according to the characteristic distribution; wherein said determining the extraction probability for each of said training sets according to said loss function comprises: solving the loss function to obtain an optimal solution, and determining the extraction probability of each training set according to the optimal solution;
and extracting samples from a plurality of training sets according to the extraction probability to form the new training set.
5. A model training method based on feature distribution is applied to visual recognition and is characterized by comprising the following steps:
acquiring a test set and a plurality of training sets;
generating a new training set according to the test set and a plurality of training sets by adopting the training set generating method based on the characteristic distribution as claimed in any one of claims 1 to 3;
and training the model by using the new training set.
6. A model training device based on feature distribution, applied to visual recognition, comprising:
the data set acquisition module is used for acquiring a test set and a plurality of training sets;
a training set generating module, configured to generate a new training set according to the test set and a plurality of training sets by using the training set generating method based on the feature distribution as set forth in any one of claims 1 to 3;
and the training module is used for training the model by using the new training set.
7. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the feature distribution-based training set generation method according to any of claims 1-3 or implements the feature distribution-based model training method according to claim 5.
8. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the feature distribution-based training set generation method according to any of claims 1-3 or the feature distribution-based model training method according to claim 5 when executing the program.
CN201911201464.8A 2019-11-29 2019-11-29 Training set generation and model training method and device based on feature distribution Active CN110909889B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911201464.8A CN110909889B (en) 2019-11-29 2019-11-29 Training set generation and model training method and device based on feature distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911201464.8A CN110909889B (en) 2019-11-29 2019-11-29 Training set generation and model training method and device based on feature distribution

Publications (2)

Publication Number Publication Date
CN110909889A CN110909889A (en) 2020-03-24
CN110909889B true CN110909889B (en) 2023-05-09

Family

ID=69820804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911201464.8A Active CN110909889B (en) 2019-11-29 2019-11-29 Training set generation and model training method and device based on feature distribution

Country Status (1)

Country Link
CN (1) CN110909889B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523663B (en) * 2020-04-22 2023-06-23 北京百度网讯科技有限公司 Target neural network model training method and device and electronic equipment
CN111652327A (en) * 2020-07-16 2020-09-11 北京思图场景数据科技服务有限公司 Model iteration method, system and computer equipment
CN112651429B (en) * 2020-12-09 2022-07-12 歌尔股份有限公司 Audio signal time sequence alignment method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254195A (en) * 2011-07-25 2011-11-23 广州市道真生物科技有限公司 Training set generation method
WO2018111116A2 (en) * 2016-12-13 2018-06-21 Idletechs As Method for handling multidimensional data
CN109697461A (en) * 2018-12-11 2019-04-30 中科恒运股份有限公司 Disaggregated model training method and terminal device based on finite data
CN109815223A (en) * 2019-01-21 2019-05-28 北京科技大学 A kind of complementing method and complementing device for industry monitoring shortage of data
CN109886403A (en) * 2019-01-28 2019-06-14 中国石油大学(华东) A kind of industrial data generation method based on neural network model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254195A (en) * 2011-07-25 2011-11-23 广州市道真生物科技有限公司 Training set generation method
WO2018111116A2 (en) * 2016-12-13 2018-06-21 Idletechs As Method for handling multidimensional data
CN109697461A (en) * 2018-12-11 2019-04-30 中科恒运股份有限公司 Disaggregated model training method and terminal device based on finite data
CN109815223A (en) * 2019-01-21 2019-05-28 北京科技大学 A kind of complementing method and complementing device for industry monitoring shortage of data
CN109886403A (en) * 2019-01-28 2019-06-14 中国石油大学(华东) A kind of industrial data generation method based on neural network model

Also Published As

Publication number Publication date
CN110909889A (en) 2020-03-24

Similar Documents

Publication Publication Date Title
US11735176B2 (en) Speaker diarization using speaker embedding(s) and trained generative model
CN108898186B (en) Method and device for extracting image
US11062090B2 (en) Method and apparatus for mining general text content, server, and storage medium
CN108509915B (en) Method and device for generating face recognition model
CN109034069B (en) Method and apparatus for generating information
CN110909889B (en) Training set generation and model training method and device based on feature distribution
EP4030381A1 (en) Artificial-intelligence-based image processing method and apparatus, and device and storage medium
CN116824278B (en) Image content analysis method, device, equipment and medium
CN112532897A (en) Video clipping method, device, equipment and computer readable storage medium
CN110378346B (en) Method, device and equipment for establishing character recognition model and computer storage medium
CN108877787A (en) Audio recognition method, device, server and storage medium
CN112188306A (en) Label generation method, device, equipment and storage medium
KR20210098397A (en) Response speed test method, apparatus, device and storage medium of vehicle device
CN116340778B (en) Medical large model construction method based on multiple modes and related equipment thereof
CN111914841B (en) CT image processing method and device
CN116415020A (en) Image retrieval method, device, electronic equipment and storage medium
CN114863450B (en) Image processing method, device, electronic equipment and storage medium
CN112802495A (en) Robot voice test method and device, storage medium and terminal equipment
CN110209880A (en) Video content retrieval method, Video content retrieval device and storage medium
CN113822589A (en) Intelligent interviewing method, device, equipment and storage medium
CN114565814B (en) Feature detection method and device and terminal equipment
CN108710697B (en) Method and apparatus for generating information
CN114549846A (en) Method and device for determining image information, electronic equipment and storage medium
CN113971742A (en) Key point detection method, model training method, model live broadcasting method, device, equipment and medium
CN116861855A (en) Multi-mode medical resource determining method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230323

Address after: 316-318, block a, Rongke Information Center, No.2, South Road, Academy of Sciences, Haidian District, Beijing, 100190

Applicant after: MEGVII (BEIJING) TECHNOLOGY Co.,Ltd.

Applicant after: Shanghai kuangjing Boxuan Technology Co.,Ltd.

Address before: 316-318, block a, Rongke Information Center, No.2, South Road, Academy of Sciences, Haidian District, Beijing, 100190

Applicant before: MEGVII (BEIJING) TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant