CN112232384A

CN112232384A - Model training method, image feature extraction method, target detection method and device

Info

Publication number: CN112232384A
Application number: CN202011035233.7A
Authority: CN
Inventors: 王远江; 郑凯; 袁野
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2021-01-15
Anticipated expiration: 2040-09-27
Also published as: CN112232384B

Abstract

The embodiment of the application discloses a model training method, an image feature extraction method, a target detection method and a target detection device. An embodiment of the method comprises: acquiring a first sample set, wherein the first sample set comprises sample images; extracting partial sample images from the first sample set as target sample images, and executing the following training steps: inputting each target sample image into the initial model to obtain the characteristic information of each target sample image; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting parameters of the initial model based on the loss value; and in response to detecting that the training of the initial model is finished, determining the initial model after the parameters are adjusted as an image feature extraction model. This embodiment has reduced the human cost when the model training, has improved the accuracy of model simultaneously.

Description

Model training method, image feature extraction method, target detection method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a model training method, an image feature extraction method, a target detection method and a target detection device.

Background

With the development of the field of artificial intelligence, model training tasks are more and more. For example, when training a model such as an image feature extraction model or a target detection model, a large number of samples with labels are generally required to perform sufficiently supervised learning on the model.

In the prior art, the samples are labeled manually, so that model training is performed by using the labeled samples. This manual labeling method is labor-intensive. Meanwhile, subjective differences existing in manual labeling can cause that the labeling is not accurate enough, so that the output result of the model is not accurate enough.

Disclosure of Invention

The embodiment of the application provides a model training method, an image feature extraction method, a target detection method and a target detection device, and aims to solve the technical problems that in the prior art, the labor cost is high and the accuracy of an output result of a model is low due to manual labeling of samples in the model training process.

In a first aspect, an embodiment of the present application provides an image feature extraction model training method, including: acquiring a first sample set, wherein the first sample set comprises sample images; extracting partial sample images from the first sample set as target sample images, and executing the following training steps: inputting each target sample image into the initial model to obtain the characteristic information of each target sample image; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting the parameters of the initial model based on the loss value; and in response to detecting that the initial model training is completed, determining the initial model after the parameters are adjusted as an image feature extraction model.

In a second aspect, an embodiment of the present application provides an image feature extraction method, including: acquiring a target image; and inputting the target image into an image feature extraction model obtained by training by adopting the method in the first aspect to obtain feature information of the target image.

In a third aspect, an embodiment of the present application provides a target detection method, including: acquiring a target image; inputting the target image into a pre-trained target detection model to obtain a target detection result of the target image, wherein the target detection model comprises an image feature extraction model, and the image feature extraction model is obtained by training through the method in the first aspect.

In a fourth aspect, an embodiment of the present application provides an image feature extraction model training apparatus, where the apparatus includes: an acquisition unit configured to acquire a first sample set including sample images; a first training unit configured to extract a part of sample images from the first sample set as target sample images, performing the following training steps: inputting each target sample image into the initial model to obtain the characteristic information of each target sample image; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting the parameters of the initial model based on the loss value; and in response to detecting that the initial model training is completed, determining the initial model after the parameters are adjusted as an image feature extraction model.

In a fifth aspect, an embodiment of the present application provides an image feature extraction apparatus, including: an acquisition unit configured to acquire a target image; an input unit configured to input the target image into an image feature extraction model obtained by training with the method in the first aspect, so as to obtain feature information of the target image.

In a sixth aspect, an embodiment of the present application provides an object detection apparatus, including: an acquisition unit configured to acquire a target image; an input unit, configured to input the target image into a pre-trained target detection model, so as to obtain a target detection result of the target image, where the target detection model includes an image feature extraction model, and the image feature extraction model is obtained by training with the method in the first aspect.

In a seventh aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to carry out the method as described in the first aspect.

In an eighth aspect, embodiments of the present application provide a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the method as described in the first aspect.

According to the model training method, the image feature extraction method, the target detection method and the device, part of sample images are extracted from a first sample set to serve as target sample images, and the target sample images are input into an initial model to obtain feature information of the target sample images; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting parameters of the initial model based on the loss value; and finally, when the initial model training is finished, determining the initial model after the parameters are adjusted as an image feature extraction model. The embodiment performs automatic judgment of positive and negative samples by clustering the characteristic information, so that a training mode of self-supervision learning is adopted. The samples used by the training mode do not need manual labeling, so that the labor cost is greatly reduced, and the accuracy of the output result of the model is improved because the problem of subjective difference in the sample labeling process is not involved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow diagram of one embodiment of a model training method according to the present application;

FIG. 2 is a schematic diagram of a model training process according to the present application;

FIG. 3 is a flow diagram of one embodiment of an image feature extraction method according to the present application;

FIG. 4 is a flow diagram of one embodiment of a target detection method according to the present application;

FIG. 5 is a schematic diagram of an embodiment of a model training apparatus according to the present application;

FIG. 6 is a schematic structural diagram of one embodiment of an image feature extraction apparatus according to the present application;

FIG. 7 is a schematic block diagram of one embodiment of an object detection device according to the present application;

fig. 8 is a schematic structural diagram of a computer system for implementing an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to FIG. 1, a flow 100 of one embodiment of a model training method according to the present application is shown. The model training method comprises the following steps:

step 101, a first sample set is obtained.

In this embodiment, an executing subject (e.g., an electronic device such as a server) of the model training method may acquire the first sample set in various ways. For example, the executing entity may obtain the existing first sample set stored therein from another server (e.g., a database server) for storing samples through a wired connection or a wireless connection. As another example, the user may collect samples via the terminal device, such that the executing entity may receive the samples collected by the terminal and store the samples locally, thereby generating the first set of samples. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX (World Interoperability for Microwave Access) connection, a Zigbee (Zigbee protocol) connection, an UWB (ultra wide band) connection, and other wireless connection means now known or developed in the future.

In this embodiment, a large number of sample images may be included in the first sample set. The sample image here may be an unlabeled image.

It can be understood that many image processing models (such as an object detection model, an object recognition model, and the like, an image classification model, and the like) need to extract image features when processing an image, and thus all include an image feature extraction network. Conventional training approaches typically use labeled image samples to supervised train the model as a whole, and thus rely on a large number of accurately labeled samples. However, the image feature extraction networks in the various image processing models are used for extracting feature information of images, and accurate results (such as classification results, identification results and the like) do not need to be output, so the structures and parameters of the image feature extraction networks in the image processing models of different application scenes are often the same or similar, and the difference is mainly in the subsequent network structures (such as classification networks) connected with the image feature extraction networks. In view of this, in the embodiments of the present application, the image feature extraction networks in the image processing models are individually used as training objects, and a large number of unlabelled image samples are used to train the image feature extraction networks, so as to obtain an image feature extraction model capable of accurately extracting image features. On the basis, a small amount of labeled image samples can be further used for training the whole image processing model, so that the image processing model can accurately output the required result. Since the unmarked images are easy to obtain, the unmarked images are used for training the image feature extraction model, so that the labor cost in the marking process can be greatly saved, and the problem of inaccurate model output caused by subjective difference of manual marking can be avoided.

In some optional implementations of this embodiment, the first set of samples may be obtained by data enhancement of the second set of samples. At this time, the original sample image in the second sample set and the enhanced sample image enhanced from the original sample image may be included in the first sample set. Specifically, an unlabeled second sample set may be obtained first, the second sample set including the original sample image. The second sample set here can be various existing image sets that are unlabeled. Then, at least one of the following operations is performed on the original sample image to obtain an enhanced sample image corresponding to the original sample image: random clipping, horizontal turning, chroma adjustment, brightness adjustment, saturation adjustment and Gaussian noise addition. And finally, summarizing the original sample image in the second sample set and the obtained enhanced sample image to obtain the first sample set. In practice, the data enhancement operation may be performed in a GPU (Graphics Processing Unit) to reduce image Processing time.

And 102, extracting a part of sample images from the first sample set as target sample images.

In this embodiment, the executing entity may select a sample from the first sample set, and execute the training steps from step 103 to step 106. The manner of extracting the sample and the number of samples extracted are not limited in this application. For example, at least one sample may be randomly extracted, or a sample with better definition of a sample image may be extracted from the sample.

And 103, inputting each target sample image into the initial model to obtain the characteristic information of each target sample image.

In this embodiment, the image feature extraction network in the image processing model (such as the target detection model, the target recognition model, and the image classification model) that has not been trained may be used as the initial model, and each target sample image extracted in the previous step may be input to the initial model to obtain the feature information of each target sample image.

As an example, the initial model may be a backbone network (backbone) in the target detection model. After each target sample image is input into the initial model, feature maps of multiple scales, such as feature maps of 5 scales, which are respectively denoted as P3, P4, P5, P6, and P7, may be output. Wherein, P3, P4, and P5 may be obtained by performing convolution with a size of 1 × 1 on three feature maps (which may be respectively referred to as C3, C4, and C5) generated by a deep convolutional neural network, P6 may be obtained by performing convolution operation with a step size of 2 on P5 (which may be referred to as downsampling), and P7 may be obtained by performing convolution operation with a step size of 2 on P6. The number and the scale of the feature maps are not limited in this embodiment. Here, the backbone network may be connected to a full connected layer (FC), so that the obtained feature maps of multiple scales are input to the full connected layer together to obtain the feature information in the form of vectors.

104, clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; and determining a positive sample image corresponding to each target sample image.

In this embodiment, the executing agent may adopt various clustering algorithms, such as a k-means clustering algorithm (k-means), to cluster the obtained feature information to obtain a clustering result. As an example, if the preset batch _ size (the number of target sample images selected in one training) is 256 and 10 clusters are set, after the batch _ size is clustered, the 256 target sample images may be divided into 10 clusters. The similarity of the feature information of the target sample images in the same cluster is high, and the similarity of the feature information of the target sample images in different clusters is low.

As an example, the executing entity may cluster the obtained feature information by: in the first step, a preset number (e.g., 10) of cluster centers are obtained. The form of the cluster center may be the same as the feature information, such as 128-dimensional vector. The clustering center may be randomly selected from the obtained feature information when the training step is performed for the first time, and the clustering center may be updated based on the clustering result when the training step is performed for the subsequent time.

And secondly, detecting the distance from each obtained characteristic information to each clustering center. For example, the inner product of the vectors may be used to compute the result characterizing the distance to the center of the cluster.

And thirdly, regarding each obtained feature information, taking the cluster corresponding to the cluster center with the minimum distance with the feature information as the cluster to which the feature information belongs.

It should be noted that, after clustering the obtained feature information, for each cluster, the execution main body may select one feature information from the cluster one by one as target feature information, and perform weighted summation on the cluster center of the cluster and the target feature information to obtain a weighted summation result, and finally replace the cluster center of the cluster with the weighted summation result. See in particular the following formula:

C’＝a×C+(1-a)×current_feature

wherein a is a preset weight, C is an original clustering center, and current _ feature is currently selected target feature information. And C' is the updated cluster center. Therefore, in the training process, along with the updating of the characteristic information, the clustering result and the clustering center are also updated in real time.

In this embodiment, after obtaining the clustering result of the feature information, the execution subject may determine, based on the clustering result, a positive sample image and a negative sample image corresponding to each target sample image; or determining a negative sample image corresponding to each target sample image based on the clustering result, and determining a positive sample image corresponding to each target sample image in other manners.

As an example, for each target sample image, the execution subject may regard, as a positive sample image, another target sample image belonging to the same cluster as the target sample image, and regard, as a negative sample image, another target sample image belonging to a different cluster from the target sample image.

As another example, for each target sample image, the executing entity may select a part of the sample images (e.g., the target sample image satisfying a preset condition) from other target sample images belonging to the same cluster as the target sample image as a positive sample image, and may select other target sample images belonging to different clusters from the target sample image as negative sample images.

As another example, for each target sample image, the executing entity may select a part of the sample images (e.g., the target sample image satisfying the preset condition) from other target sample images belonging to the same cluster as the target sample image as a positive sample image, and use the other sample images as negative sample images.

In some optional implementations of this embodiment, the sample images in the first sample set may include an original sample image (e.g., an image a of a dog) and an enhanced sample image corresponding to the original sample image (e.g., an image a 'obtained by horizontally flipping the image a and an image a' obtained by chroma-adjusting the image a). At this time, for each target sample image, the original sample image and/or the enhanced sample image of the target sample image are/is usually included in the other samples belonging to the same cluster as the target sample image. Specifically, the following two cases are included: and if the target sample image is the original sample image, the other samples belonging to the same cluster with the target sample image comprise the enhanced sample image of the target sample image. If the target sample image is an enhanced sample image, the original sample image of the target sample image and other enhanced sample images of the original sample image are included in other samples belonging to the same cluster as the target sample image.

At this time, based on the clustering result, the executing entity may determine the positive sample image and the negative sample image corresponding to each target sample image by:

and step one, setting a clustering label for each target sample image based on a clustering result.

The target sample images with the characteristic information belonging to the same cluster have the same cluster label, and the target sample images with the characteristic information not belonging to the same cluster have different cluster labels.

The cluster labels are used to indicate and distinguish between different clusters, and the cluster labels are different from the class labels in conventional labeled sample images. For example, a cluster may include an original sample image a of a poodle dog, enhanced sample images a 'and a "of the poodle dog, and an original sample image B of samoie and enhanced sample images B' and B" of samoie. If A, A ', a ", B, B', and B" are grouped into a class, A, A ', a ", B, B', and B" may have the same cluster label.

Secondly, for each target sample image, selecting an enhanced sample image and/or an original sample image corresponding to the target sample image from the rest sample images with the same clustering labels as the target sample image as a positive sample image corresponding to the target sample image; and taking each sample image with a different clustering label with the target sample image as a negative sample image corresponding to the target sample image.

Continuing with the above example, if a target sample image is the original sample image a of a poodle dog in a cluster, its corresponding positive sample image may include enhanced sample images a' and a "of the poodle dog. The corresponding negative sample images may include target sample images in the remaining clusters, such as an original sample image C of a certain british shorthair cat, enhanced sample images C 'and C "of the british shorthair cat, an original sample image D of a certain orange cat, enhanced sample images D' and D" of the british shorthair cat, and so on.

It should be noted that, for an original sample image B of a samoia and enhanced sample images B 'and B' of the samoia, which belong to the same cluster as the target sample image, may be disregarded in the process of training the initial model, and then specific processing is performed on subsequent training image processing models (such as a target detection model, a target recognition model, and an image classification model) according to task scenes. For example, if the training task is to classify canine breeds, different breeds of canines (such as the aforementioned poodle dog and samoye) can be used as different types of samples in training the canine breed classification model. If the training task is species classification (e.g., to distinguish between different species such as dogs and cats), different breeds of dogs (such as the poodle dog and samoie dog mentioned above) can be used as the same type of sample when training the species classification model. As another example, if the training task is dog identification (e.g., to distinguish between different dogs), different dogs (e.g., the poodle and samoye, or one poodle and another poodle) may be used as different types of samples in training the dog identification model.

In an alternative implementation, the enhanced sample image has some association with the original sample image, for example, the enhanced sample image is named to have the same prefix as the original sample image when the enhanced sample image is generated from the original sample image, or the association between the enhanced sample image and the original sample image is stored when the enhanced sample image is generated. At this time, the enhanced sample image corresponding to the original sample image can be directly searched and taken as the positive sample image corresponding to the target sample image, and the main body determines the negative sample image corresponding to each target sample image based on the clustering result according to the steps.

And determining the positive sample image and the negative sample image corresponding to each target sample image according to the clustering result, so that the distance between the positive sample image and the corresponding target sample image is small enough, and the distance between the negative sample image and the corresponding target sample image is large enough, thereby achieving the purpose of classification. Meanwhile, the positive sample image and the negative sample image corresponding to each target sample image are determined according to the clustering result, so that the model can independently learn the similarity between the sample images, automatic labeling based on the clustering result is realized, the problem of inaccurate label caused by artificial subjective factors and the like is avoided, and the labeling cost can be greatly reduced.

And 105, determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting the parameters of the initial model based on the loss value.

In this embodiment, the execution subject may determine a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjust the parameter of the initial model based on the loss value. The loss value is a value of a loss function, and the loss function is a non-negative real-valued function and can be used for representing the difference between the detection result and the real result. In general, the smaller the loss value, the better the robustness of the model. The loss function may be set according to actual requirements. Here, the distance between the feature information of each target sample image and the feature information of the corresponding positive sample image and the distance between the feature information of each target sample image and the feature information of the corresponding negative sample image may be calculated first. Then, the calculated distance is input to a preset loss function to obtain a loss value. The execution agent may then update the parameters of the initial model with the loss value. Here, the gradient of the loss value with respect to the model parameters may be found using a back propagation algorithm, and then the model parameters may be updated based on the gradient using a gradient descent algorithm. Therefore, each time a sample image is input, the parameters of the convolutional neural network can be updated once based on the loss value corresponding to the sample image until the training is completed.

In practice, whether training is complete may be determined in a number of ways. As an example, training may be determined to be complete when the loss value converges to a certain value. As yet another example, the training may be determined to be completed if the number of times of training of the initial model is equal to a preset number of times. The present embodiment does not specifically limit the determination condition for completion of training.

It should be noted that if it is determined that the initial model is trained, the following step 106 may be performed. If the initial model is determined to be untrained, part of the sample images can be extracted from the first sample set again to serve as target sample images, and the training step is continuously executed by using the initial model and the new target sample images after the parameters are adjusted. It should be noted that the extraction method is not limited in this application. For example, in the case where there are a large number of sample images in the first sample set, the execution subject may extract a sample image from which it has not been extracted.

In some optional implementations of this embodiment, the executing entity may calculate the loss value by:

in the first step, for each target sample image, a sum of distances (which may be referred to as a first distance sum for convenience of distinction) between feature information of the target sample image and feature information of each positive sample image corresponding to the target sample image is detected, and a sum of second distances (which may be referred to as a second distance sum) between feature information of the target sample image and feature information of each negative sample image corresponding to the target sample image is detected.

And secondly, taking the ratio of the sum of the first distances and the sum of the second distances corresponding to each target sample image as the loss value corresponding to the target sample image.

And thirdly, summing the loss values corresponding to the target sample images to obtain the loss value of the initial model.

Thus, for each target sample image, the loss value decreases as the difference between the feature information of the target sample image and the feature information of the corresponding positive sample image decreases and as the difference between the feature information of the target sample image and the feature information of the corresponding negative sample image increases. The final goal is to make the difference between the feature information of the target sample image and the feature information of each corresponding positive sample image as small as possible, and the difference between the feature information of the target sample image and the feature information of each corresponding negative sample image as large as possible, so as to improve the accuracy of model extraction features.

And step 106, in response to the detection that the training of the initial model is completed, determining the initial model after the parameters are adjusted as an image feature extraction model.

In this embodiment, in response to detecting that the initial model training is completed, the executing entity may determine the initial model after adjusting the parameters as the image feature extraction model.

As an example, fig. 2 is a schematic diagram of a model training process according to the present application. As shown in fig. 2, a first sample set is obtained, and the first sample set may include a large number of sample images, specifically including an original sample image and an enhanced sample image corresponding to the original sample image. Then, a group of target sample images are selected from the first sample set and input to an initial model, such as a backbone network (backbone), so as to obtain a multi-scale feature map. And then inputting the multi-scale characteristic graph into a full connection layer to obtain characteristic information in a vector form. And clustering the characteristic information, and determining a positive sample image and a negative sample image corresponding to the target sample image based on the clustering result. Then, a loss value is determined based on the positive sample image and the negative sample image, and finally, parameters of the initial model are updated based on the loss value. Thus, a training process is completed. The training process can be iteratively executed for multiple times to obtain an image feature extraction model.

In some optional implementations of this embodiment, the executing subject may store the generated image feature extraction model locally, or may send it to a terminal device or a database server for storing data.

In some optional implementation manners of this embodiment, the executing subject may use the image feature extraction model as a feature extraction network in the target detection model to establish the initial target detection model. The initial target detection model may include a classification network, a location detection network, and the like, in addition to the feature extraction network. Then, the executing body may obtain a labeled third sample set, where the third sample set includes sample images with category labels. And finally, taking the sample images in the third sample set as input, and training the initial target detection model by using a machine learning method based on the class labels of the input sample images to obtain the trained target detection model. Because the feature extraction network in the target detection model is trained, the training of the initial target detection model can be completed by using a small amount of image samples with labels. Therefore, the labor cost of manual labeling can be greatly reduced.

In the method provided by the above embodiment of the present application, a part of sample images is extracted from a first sample set to serve as target sample images, and each target sample image is input to an initial model to obtain feature information of each target sample image; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting parameters of the initial model based on the loss value; and finally, when the initial model training is finished, determining the initial model after the parameters are adjusted as an image feature extraction model. The embodiment performs automatic judgment of positive and negative samples by clustering the characteristic information, so that a training mode of self-supervision learning is adopted. The samples used by the training mode do not need manual labeling, so that the labor cost is greatly reduced, and the accuracy of the output result of the model is improved because the problem of subjective difference in the sample labeling process is not involved.

With further reference to fig. 3, a flow 300 of yet another embodiment of an image feature extraction method is shown. The flow 300 of the image feature extraction method includes the following steps:

step 301, a target image is acquired.

In this embodiment, an executing subject of the image feature extraction method may acquire a target image, where the target image may be an image of an image feature to be extracted.

Step 302, inputting a target image into an image feature extraction model obtained by pre-training, and obtaining feature information of the target image.

In this embodiment, the executing entity may input the target image to an image feature extraction model obtained by training in advance, so as to obtain feature information of the target image. The image feature extraction model in the present embodiment may be generated by the method described in the embodiment of fig. 1 above. For a specific generation process, reference may be made to the related description of the embodiment in fig. 1, which is not described herein again.

It should be noted that the image feature extraction method of the present embodiment may be used to test the image feature extraction model generated by the above-mentioned embodiment. And then the image feature extraction model can be continuously optimized according to the test result. The method may also be a practical application method of the image feature extraction model generated in the above embodiment. The image feature extraction is performed by adopting the image feature extraction model generated by the embodiment, which is helpful for improving the accuracy of the extracted image features.

With further reference to fig. 4, a flow 400 of yet another embodiment of a target detection method is shown. The process 400 of the target detection method includes the following steps:

step 401, a target image is acquired.

In this embodiment, the executing subject of the target detection method may acquire a target image, wherein the target image may be an image of an image feature to be extracted.

Step 402, inputting a target image into a pre-trained target detection model to obtain a target detection result of the target image, wherein the target detection model comprises an image feature extraction model.

In this embodiment, the executing entity may input the target image to a target detection model obtained by training in advance, so as to obtain feature information of the target image. The target detection model in this embodiment may include an image feature extraction model for extracting image features. The image feature extraction model is generated using the method described above in the embodiment of fig. 1. For a specific generation process, reference may be made to the related description of the embodiment in fig. 1, which is not described herein again.

It should be noted that the target detection method of the present embodiment may be used to test the image feature extraction model generated by the above embodiments. And then the image feature extraction model can be continuously optimized according to the test result. The method may also be a practical application method of the image feature extraction model generated in the above embodiment. The image feature extraction model generated by the embodiment is adopted to extract the image features in the target detection process, and the accuracy of the extracted image features and the accuracy of the target detection are improved.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an image feature extraction model training apparatus, which corresponds to the embodiment of the method shown in fig. 1, and which can be applied in various electronic devices.

As shown in fig. 5, the model training apparatus 500 of the present embodiment includes: an obtaining unit 501 configured to obtain a first sample set, where the first sample set includes a sample image; a first training unit 502 configured to extract a part of the sample images from the first sample set as target sample images, and perform the following training steps: inputting each target sample image into the initial model to obtain the characteristic information of each target sample image; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting the parameters of the initial model based on the loss value; and in response to the detection that the training of the initial model is completed, determining the initial model after the parameters are adjusted as an image feature extraction model.

In some optional implementations of this embodiment, the obtaining unit 501 is further configured to: acquiring an unmarked second sample set, wherein the second sample set comprises an original sample image; executing at least one of the following operations on the original sample image to obtain an enhanced sample image corresponding to the original sample image: random cutting, horizontal turning, chroma adjustment, brightness adjustment, saturation adjustment and Gaussian noise addition; and summarizing the original sample image in the second sample set and the obtained enhanced sample image to obtain a first sample set.

In some optional implementations of this embodiment, the sample images in the first sample set include an original sample image and an enhanced sample image corresponding to the original sample image; and, the first training unit 502 is further configured to: setting clustering labels for all target sample images based on clustering results, wherein the target sample images with characteristic information belonging to the same cluster have the same clustering label, and the target sample images with characteristic information not belonging to the same cluster have different clustering labels; for each target sample image, selecting an enhanced sample image and/or an original sample image corresponding to the target sample image from the rest sample images with the same clustering labels as the target sample image to serve as a positive sample image corresponding to the target sample image; and taking each sample image with a different clustering label with the target sample image as a negative sample image corresponding to the target sample image.

In some optional implementations of this embodiment, the first training unit 502 is further configured to: for each target sample image, detecting the sum of first distances between the feature information of the target sample image and the feature information of each positive sample image corresponding to the target sample image, and detecting the sum of second distances between the feature information of the target sample image and the feature information of each negative sample image corresponding to the target sample image; taking the ratio of the sum of the first distances to the sum of the second distances as a loss value corresponding to the target sample image; and summing the loss values corresponding to the target sample images to obtain the loss value of the initial model.

In some optional implementations of this embodiment, the first training unit 502 is further configured to: acquiring a preset number of clustering centers; detecting the distance from each obtained characteristic information to each clustering center; for each obtained feature information, a cluster corresponding to the cluster center having the smallest distance to the feature information is used as the cluster to which the feature information belongs.

In some optional implementation manners of this embodiment, the training step further includes: and for each cluster, selecting one piece of feature information from the cluster one by one as target feature information, carrying out weighted summation on the cluster center of the cluster and the target feature information to obtain a weighted summation result, and replacing the cluster center of the cluster with the weighted summation result.

In some optional implementations of this embodiment, the apparatus further includes: an execution unit configured to: and in response to the detection that the initial model is not trained completely, extracting part of sample images from the first sample set again to serve as target sample images, and continuing to execute the training step by using the initial model with the adjusted parameters and the new target sample images.

In some optional implementations of this embodiment, the apparatus further includes: a second training unit configured to: establishing an initial target detection model by taking the image feature extraction model as a feature extraction network in a target detection model; acquiring a labeled third sample set, wherein the third sample set comprises sample images with category labels; and taking the sample images in the third sample set as input, and training the initial target detection model by using a machine learning method based on the class labels of the input sample images to obtain a trained target detection model.

The device provided by the above embodiment of the present application obtains feature information of each target sample image by extracting a part of the sample images from the first sample set as the target sample images and inputting each target sample image into the initial model; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting parameters of the initial model based on the loss value; and finally, when the initial model training is finished, determining the initial model after the parameters are adjusted as an image feature extraction model. The embodiment performs automatic judgment of positive and negative samples by clustering the characteristic information, so that a training mode of self-supervision learning is adopted. The samples used by the training mode do not need manual labeling, so that the labor cost is greatly reduced, and the accuracy of the output result of the model is improved because the problem of subjective difference in the sample labeling process is not involved.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an image feature extraction apparatus, which corresponds to the method embodiment shown in fig. 3, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the image feature extraction device 600 of the present embodiment includes: an acquisition unit 601 configured to acquire a target image; an input unit 602 configured to input the target image into an image feature extraction model, and obtain feature information of the target image.

It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 3. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.

With further reference to fig. 7, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an object detection apparatus, which corresponds to the method embodiment shown in fig. 4, and which is particularly applicable to various electronic devices.

As shown in fig. 7, the object detection apparatus 700 of the present embodiment includes: an acquisition unit 701 configured to acquire a target image; an input unit 702 is configured to input the target image into a pre-trained target detection model, so as to obtain a target detection result of the target image, where the target detection model includes an image feature extraction model, and the image feature extraction model is trained by the method in the embodiment corresponding to fig. 4.

It will be understood that the elements described in the apparatus 700 correspond to various steps in the method described with reference to fig. 3. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 700 and the units included therein, and will not be described herein again.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The units described may also be provided in a processor, where the names of the units do not in some cases constitute a limitation of the units themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a first sample set, wherein the first sample set comprises sample images; extracting partial sample images from the first sample set as target sample images, and executing the following training steps: inputting each target sample image into the initial model to obtain the characteristic information of each target sample image; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting parameters of the initial model based on the loss value; and in response to detecting that the training of the initial model is finished, determining the initial model after the parameters are adjusted as an image feature extraction model.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. An image feature extraction model training method, characterized in that the method comprises:

acquiring a first sample set, wherein the first sample set comprises sample images;

extracting partial sample images from the first sample set as target sample images, and executing the following training steps: inputting each target sample image into the initial model to obtain the characteristic information of each target sample image; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting the parameters of the initial model based on the loss value; and in response to detecting that the initial model training is completed, determining the initial model after the parameters are adjusted as an image feature extraction model.

2. The method of claim 1, wherein obtaining the first set of samples comprises:

obtaining an unlabeled second sample set, wherein the second sample set comprises an original sample image;

performing at least one of the following operations on the original sample image to obtain an enhanced sample image corresponding to the original sample image: random cutting, horizontal turning, chroma adjustment, brightness adjustment, saturation adjustment and Gaussian noise addition;

and summarizing the original sample image in the second sample set and the obtained enhanced sample image to obtain a first sample set.

3. The method according to one of claims 1-2, wherein the sample images in the first sample set comprise an original sample image and an enhanced sample image corresponding to the original sample image; and determining the negative sample image corresponding to each target sample image based on the clustering result, wherein the determining comprises:

setting clustering labels for all target sample images based on clustering results, wherein the target sample images with characteristic information belonging to the same cluster have the same clustering label, and the target sample images with characteristic information not belonging to the same cluster have different clustering labels;

and taking each sample image with a different clustering label with the target sample image as a negative sample image corresponding to the target sample image.

4. The method according to one of claims 2 or 3, wherein the determining of the positive sample image corresponding to each target sample image comprises:

for each target sample image, selecting an enhanced sample image and/or an original sample image corresponding to the target sample image from the rest sample images with the same clustering labels as the target sample image to serve as a positive sample image corresponding to the target sample image;

or, for each target sample image, selecting an enhanced sample image and/or an original sample image corresponding to the target sample image as a positive sample image corresponding to the target sample image.

5. The method according to any one of claims 1 to 4, wherein determining the loss value based on the positive sample image and the negative sample image corresponding to each target sample image comprises:

for each target sample image, detecting the sum of first distances between the feature information of the target sample image and the feature information of each positive sample image corresponding to the target sample image, and detecting the sum of second distances between the feature information of the target sample image and the feature information of each negative sample image corresponding to the target sample image; taking the ratio of the sum of the first distances to the sum of the second distances as a loss value corresponding to the target sample image;

and summing the loss values corresponding to the target sample images to obtain the loss value of the initial model.

6. The method according to one of claims 1 to 5, wherein the clustering of the obtained feature information comprises:

acquiring a preset number of clustering centers;

detecting the distance from each obtained characteristic information to each clustering center;

for each obtained feature information, a cluster corresponding to the cluster center having the smallest distance to the feature information is used as the cluster to which the feature information belongs.

7. The method of claim 6, wherein after the clustering the obtained feature information, the training step further comprises:

and for each cluster, selecting one piece of feature information from the cluster one by one as target feature information, carrying out weighted summation on the cluster center of the cluster and the target feature information to obtain a weighted summation result, and replacing the cluster center of the cluster with the weighted summation result.

8. The method according to one of claims 1 to 7, characterized in that the method further comprises:

and in response to detecting that the initial model is not trained completely, extracting part of sample images from the first sample set as target sample images again, and continuing to execute the training step by using the initial model with the adjusted parameters and the new target sample images.

9. The method according to one of claims 1 to 8, characterized in that the method further comprises:

establishing an initial target detection model by taking the image feature extraction model as a feature extraction network in a target detection model;

obtaining a labeled third sample set, wherein the third sample set comprises sample images with category labels;

and taking the sample images in the third sample set as input, and training the initial target detection model by using a machine learning method based on the class labels of the input sample images to obtain a trained target detection model.

10. An image feature extraction method, characterized in that the method comprises:

acquiring a target image;

inputting the target image into an image feature extraction model obtained by training by adopting the method of any one of claims 1 to 9, and obtaining feature information of the target image.

11. A method of object detection, the method comprising:

acquiring a target image;

inputting the target image into a pre-trained target detection model to obtain a target detection result of the target image, wherein the target detection model comprises an image feature extraction model, and the image feature extraction model is obtained by training according to the method of one of claims 1 to 9.

12. An image feature extraction model training apparatus, characterized in that the apparatus comprises:

an acquisition unit configured to acquire a first sample set including sample images;

a first training unit configured to extract a part of sample images from the first sample set as target sample images, performing the following training steps: inputting each target sample image into the initial model to obtain the characteristic information of each target sample image; clustering the obtained characteristic information, and determining a negative sample image corresponding to each target sample image based on a clustering result; determining a positive sample image corresponding to each target sample image; determining a loss value based on the positive sample image and the negative sample image corresponding to each target sample image, and adjusting the parameters of the initial model based on the loss value; and in response to detecting that the initial model training is completed, determining the initial model after the parameters are adjusted as an image feature extraction model.

13. An image feature extraction device characterized by comprising:

an acquisition unit configured to acquire a target image;

an input unit configured to input the target image into an image feature extraction model obtained by training according to the method of any one of claims 1 to 9, and obtain feature information of the target image.

14. An object detection apparatus, characterized in that the apparatus comprises:

an acquisition unit configured to acquire a target image;

an input unit configured to input the target image to a pre-trained target detection model, and obtain a target detection result of the target image, wherein the target detection model includes an image feature extraction model, and the image feature extraction model is obtained by training according to the method of one of claims 1 to 9.

15. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-11.

16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-11.