CN115115871A

CN115115871A - Training method, device and equipment of image recognition model and storage medium

Info

Publication number: CN115115871A
Application number: CN202210590409.8A
Authority: CN
Inventors: 许海伦
Original assignee: Tencent Technology Chengdu Co Ltd
Current assignee: Tencent Technology Chengdu Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-09-27
Anticipated expiration: 2042-05-26
Also published as: CN115115871B

Abstract

The application discloses a training method, a training device, equipment and a storage medium of an image recognition model, and relates to the technical field of artificial intelligence. The method comprises the following steps: determining a first classification loss of a target image sample in a positive candidate region through each classifier of an image recognition model for the target image sample in a target sample subset in an image sample set; determining a second classification loss of the target image sample in the negative candidate region through a target classifier corresponding to the target sample subset; and training the image recognition model based on the first classification loss and the second classification loss which respectively correspond to each image sample in the image sample set. The method and the device for identifying the image recognition model can be applied to scenes such as artificial intelligence, intelligent traffic, auxiliary driving and the like, and can enable the negative candidate region corresponding to the target sample subset not to participate in training of classifiers except the target classifier, so that the identification accuracy of the image recognition model under the scene of multi-data combined training is improved.

Description

Training method, device and equipment of image recognition model and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a training method, a training device, training equipment and a storage medium for an image recognition model.

Background

With the development of artificial intelligence technology, research and application of the artificial intelligence technology in the field of image recognition are increasing.

At present, in a scene related to multi-data joint training of an image recognition model, a related technology directly trains the image recognition model through a plurality of image sample sets, however, because there is a recognition target conflict between each image sample set, for example, a negative sample (such as an anchor negative sample) corresponding to one image sample set is a positive sample (such as an anchor positive sample) of another image sample set, a negative training effect is generated on the image recognition model, the recognition performance of the image recognition model is inhibited, and the recognition accuracy of the trained image recognition model is not high enough.

Disclosure of Invention

The embodiment of the application provides a training method, a training device, equipment and a storage medium for an image recognition model, which can improve the recognition accuracy of the image recognition model.

According to an aspect of the embodiments of the present application, there is provided a training method of an image recognition model, the method including:

acquiring an image sample set, wherein the image sample set comprises a plurality of sample subsets, and the identification targets corresponding to the sample subsets are different;

for a target sample subset of the plurality of sample subsets, determining a target classifier set corresponding to the target sample subset from the classifiers of the image recognition model; wherein a target classifier in the set of target classifiers corresponds to an identification target to which the subset of target samples corresponds;

for a target image sample in the target sample subset, determining, by the respective classifiers, a first classification loss corresponding to the target image sample based on the respective positive candidate regions corresponding to the target image sample; wherein, an identification target corresponding to the image sample is predicted to exist in the positive candidate region;

determining a second classification loss corresponding to the target image sample based on each negative candidate region corresponding to the target image sample only through a target classifier in the target classifier set; wherein, no recognition target corresponding to the image sample is predicted in the negative candidate region;

training the image recognition model based on a first classification loss and a second classification loss corresponding to each image sample in the image sample set to obtain the trained image recognition model, wherein each classifier in the trained image recognition model is used for recognizing a recognition target corresponding to the image sample set.

According to an aspect of an embodiment of the present application, there is provided an apparatus for training an image recognition model, the apparatus including:

the system comprises a sample set acquisition module, a recognition target acquisition module and a recognition target acquisition module, wherein the sample set acquisition module is used for acquiring an image sample set, the image sample set comprises a plurality of sample subsets, and the recognition targets corresponding to the sample subsets are different;

a target classifier determining module, configured to determine, for a target sample subset of the plurality of sample subsets, a target classifier set corresponding to the target sample subset from among the classifiers of the image recognition model; wherein a target classifier in the set of target classifiers corresponds to an identification target to which the subset of target samples corresponds;

a classification loss obtaining module, configured to determine, by the classifiers, a first classification loss corresponding to a target image sample in the target sample subset based on each positive candidate region corresponding to the target image sample; wherein, an identification target corresponding to the image sample is predicted to exist in the positive candidate region;

the classification loss obtaining module is further configured to determine, based on each negative candidate region corresponding to the target image sample, a second classification loss corresponding to the target image sample only by a target classifier in the target classifier set; wherein, no recognition target corresponding to the image sample is predicted in the negative candidate region;

and the recognition model training module is used for training the image recognition model based on the first classification loss and the second classification loss which respectively correspond to each image sample in the image sample set to obtain the trained image recognition model, and each classifier in the trained image recognition model is used for recognizing the recognition target corresponding to the image sample set.

According to an aspect of the embodiments of the present application, there is provided a computer device, the computer device includes a processor and a memory, the memory stores a computer program, and the computer program is loaded and executed by the processor to implement the training method of the image recognition model.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium, in which a computer program is stored, the computer program being loaded and executed by a processor to implement the above-mentioned training method for an image recognition model.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the training method of the image recognition model.

The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects.

The target classifier corresponding to the target sample subset is determined from all classifiers corresponding to the image recognition model, and the classification loss of the negative candidate region under the target classifier is determined through the target classifier based on the negative candidate region (such as anchor negative sample) corresponding to the image sample in the target sample subset, so that the negative candidate region corresponding to the target sample subset does not participate in the training of classifiers except the target classifier, the effect that the negative candidate region corresponding to the target sample subset only takes effect in the target sample subset is achieved, the problem that the training effect is inhibited due to the recognition target conflict existing among all the image sample sets in the related art is solved, and the recognition accuracy of the image recognition model under the scene of multi-data combined training is improved.

In addition, the positive candidate regions (such as anchors) corresponding to the image samples in the target sample subset can participate in training of all classifiers corresponding to the image recognition model, so that negative information corresponding to the sample subset except the target sample subset is enriched, and the recognition accuracy of the image recognition model under a multi-data joint training scene is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an environment for implementing an embodiment provided by an embodiment of the present application;

fig. 2 is a schematic diagram of a positive and negative candidate area acquisition network according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for training an image recognition model according to an embodiment of the present application;

FIG. 4 is a diagram of a classifier control code provided in one embodiment of the present application;

FIG. 5 is a schematic illustration of the use of an image recognition model provided by one embodiment of the present application under multitasking;

FIG. 6 is a block diagram of an apparatus for training an image recognition model according to an embodiment of the present application;

FIG. 7 is a block diagram of an apparatus for training an image recognition model according to another embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "look", and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image Recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and the like, and also includes common biometric technologies such as face Recognition, fingerprint Recognition, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The technical scheme provided by the embodiment of the application relates to an artificial intelligence computer vision technology and a machine learning technology, the candidate characteristics corresponding to a positive candidate region and a positive candidate region corresponding to an image sample and the candidate characteristics corresponding to a negative candidate region and a negative candidate region are obtained by the computer vision technology, and then an image recognition model is trained (such as a positive candidate region obtaining network, a coordinate regression network and a category regression network in the image recognition model) by the machine learning technology based on the candidate characteristics corresponding to the positive candidate region and the positive candidate region corresponding to the image sample and the candidate characteristics corresponding to the negative candidate region and the negative candidate region to obtain the trained image recognition model.

According to the method provided by the embodiment of the application, the execution main body of each step can be a computer device, and the computer device refers to an electronic device with data calculation, processing and storage capabilities. The Computer device may be a terminal such as a PC (Personal Computer), a tablet Computer, a smartphone, a wearable device, a smart robot, a vehicle, or the like; or may be a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services.

The image recognition model in the embodiment of the present application may be used for tasks such as image recognition, target detection, target recognition, image segmentation, and the like, which is not limited in the embodiment of the present application. The technical scheme provided by the embodiment of the application is suitable for any scene needing to use the image recognition model, such as a target detection scene, a target recognition scene, an image segmentation scene, an image processing scene, an intelligent traffic scene, an auxiliary driving scene and the like. The technical scheme provided by the embodiment of the application can improve the identification accuracy of the image identification model under the scene of multi-data combined training.

In some examples, in a case that the image recognition model is required to be applicable to different target detection tasks, a sample subset with labeling information of different recognition targets may be obtained for the different target detection tasks, so as to obtain an image sample set of the image recognition model. For example, the above-described object detection task may be a task such as clothing detection, person detection, vehicle detection, object detection, road element detection, or the like. By adopting the technical scheme provided by the embodiment of the application, the image recognition model is subjected to joint training based on the image sample set comprising the image samples with the labeling information of different recognition targets, and the image recognition model for detecting different recognition targets can be obtained.

In other examples, in a case that the image recognition model is required to support detection of the new recognition target, a sample subset with the label information of the new recognition target may be obtained for the new recognition target. By adopting the technical scheme provided by the embodiment of the application, the image recognition model is subjected to combined training based on the original image sample set and the sample subset with the labeling information of the newly added recognition target, so that the image recognition model which can be used for detecting the original recognition target and the newly added recognition target is obtained.

The following describes a method for training an image recognition model according to an embodiment of the present application in detail.

Refer to fig. 1, which illustrates a schematic diagram of an environment for implementing an embodiment of the present application. The embodiment implementation environment may include a model training apparatus 10 and a model using apparatus 20.

The model training device 10 may be an electronic device such as a PC, a computer, a tablet, a server, a smart robot, a vehicle-mounted terminal, or some other electronic device with strong computing power. The model training apparatus 10 is used to train the image recognition model 30.

In the embodiment of the present application, the image recognition model 30 is a neural network model that can be used for tasks such as image recognition, target detection, image segmentation, and the like. For example, in an image recognition scenario, the image recognition model 30 may be used to recognize human faces, merchandise, and the like. In the target detection scenario, the image recognition model 30 may be used for target detection of vehicles, people (real or virtual characters, etc.), animals, plants (e.g., trees), and so on. In an image segmentation scenario, the image recognition model 30 may be used to segment scenes, objects, roads, buildings, etc. in an image. The embodiment of the present application does not limit the task to which the image recognition model 30 can be applied. The image recognition model 30 in the embodiment of the present application may support detection tasks of different targets, and the image recognition model 30 may be obtained by joint training through an image sample set including image samples with labeling information of different recognition targets, or multiple image sample sets with labeling information of different recognition targets, which is not limited in the embodiment of the present application.

Alternatively, the model training apparatus 10 may train the image recognition model 30 in a machine learning manner so that it has better performance.

The trained image recognition model 30 can be deployed in the model using device 20 for use to provide functions of image recognition, target detection, image segmentation, and the like. The model using device 20 may be a terminal device such as a mobile phone, a computer, a smart television, a multimedia playing device, a wearable device, a medical device, a vehicle-mounted terminal, or a server, which is not limited in this embodiment of the present application.

In some embodiments, as shown in FIG. 1, the image recognition model 30 may include a positive and negative candidate region acquisition network 310, a coordinate regression network 320, and a category regression network 330.

The positive and negative candidate region acquisition network 310 is configured to acquire a positive candidate region and a negative candidate region corresponding to an image sample based on the image sample in the image sample set. The image sample set comprises a plurality of sample subsets, and differences exist between identification targets respectively corresponding to the sample subsets. For example, different sample subsets may correspond to different recognition targets with different labeling information. In the embodiment of the present application, the candidate region refers to a region in which the recognition target may exist. The identification target refers to a target corresponding to the real labeling information, namely, a target to be identified from the image sample by the image identification model, and different identification tasks can correspond to different identification targets. For example, taking the candidate region as an anchor (also referred to as a candidate frame, etc.), in an anchor-based scene, the candidate region may refer to a plurality of anchors determined according to a set rule from the image sample, or may refer to a plurality of anchors determined according to a set rule from a feature map corresponding to the image sample. In an anchor-free scene, the candidate region may refer to a plurality of anchors determined based on a key point or a center corresponding to a certain target in the feature map, which is not limited in the embodiment of the present application. The above object is not limited in the embodiments of the present application, for example, the object may refer to a specific person (e.g., student a, student B, etc.), a character, a vehicle, a plant, etc., and the object may also refer to a person, an animal (e.g., cat, dog), a vehicle, a plant, etc.

Optionally, in an anchor-based scenario, the positive-negative candidate Region acquisition network 310 may be constructed based on fast RCNN (fast Region with CNN feature, a target detection neural network), SSD (Single Shot multi box Detector, a target detection neural network), YOLO (young Only Look one, a target detection neural network), and the like; in an anchor-free scene, the positive and negative candidate region acquisition network 310 may be constructed based on CornerNet, Grid RCNN, extreme netet, spare RCNN, and the like.

Illustratively, referring to fig. 2, the positive-negative candidate region acquisition network 310 may include a data input layer 311, a feature extraction layer 312, a candidate feature extraction layer 313, and a positive-negative sample division layer 314.

The input of the data input layer 311 may include an image sample set corresponding to the image recognition model 30, identifiers respectively corresponding to a plurality of sample subsets in the image sample set, and real annotation information respectively corresponding to each image sample in the image sample set. The real annotation information may include real coordinate annotation information and real target annotation information. The real coordinate labeling information is used for indicating the real coordinates (i.e. position, area, etc.) of the recognition target, and the real target labeling information is used for indicating the real category of the recognition target. For example, when the labeling border corresponding to the real target is a square frame, the real coordinate labeling information corresponding to the real target may be represented by a pair of diagonal points or a center point + width x height or four sides and other coordinates of the square frame corresponding to the real coordinate, and the real target labeling information corresponding to the real target may be represented by a name of the real target.

Optionally, the data input layer 311 may be configured to generate a classifier control code corresponding to each of the plurality of sample subsets, where the classifier control code is used to determine an engagement state of each classifier in the image recognition model during the training process. Illustratively, the data input layer 311 determines, based on the identifiers of the n sample subsets, recognition targets corresponding to the n sample subsets, and determines, in combination with a classifier distribution corresponding to the image recognition model 30 (for example, one recognition target corresponds to one two-classifier), classifier control codes corresponding to the n sample subsets. Alternatively, the classifier control code may be one-dimensional data of the classifier number × 1.

The feature extraction layer 312 is configured to perform feature extraction on the image sample to obtain a feature map corresponding to the image sample. Illustratively, the Feature extraction layer 312 may be constructed based on ResNet-50 (a convolutional network with a 50-layer network) and FPN (Feature Pyramid Networks). And performing feature extraction on the image sample through ResNet50 to obtain a multi-scale feature map sequence, wherein the multi-scale feature map sequence comprises a plurality of intermediate feature maps with different scales obtained in the feature extraction process. And fusing a plurality of intermediate characteristic graphs with different scales in the multi-scale characteristic graph sequence through the FPN to obtain the characteristic graph corresponding to the image sample.

The ResNet-50 can be replaced by VGGNet (Visual Geometry Group Net), Resnet-101, etc.

Optionally, the image sample may be preprocessed before feature extraction. For example, the image sample may be scaled to a set scale to obtain an adjusted image sample, and then the feature extraction layer 312 performs feature extraction on the adjusted image sample, where the set scale may be adaptively set and adjusted according to actual use requirements. For example, the dimension is set to 900 × 600 pixels. The image sample is scaled to an adjusted image sample with a long side of 900 pixels and a short side of 600 pixels according to an equal proportion, and feature extraction is performed on the adjusted image sample through the feature extraction layer 312. However, zero padding may be performed when the number of pixels on the long side or the short side is insufficient. When the image recognition model obtained in the way is used, the input image can be firstly zoomed to the corresponding set scale and then input into the image recognition model.

The candidate feature extraction layer 313 may be a neural network, which is configured to generate candidate regions on the feature map and obtain candidate features corresponding to each candidate region. The candidate area is used to indicate an area where a recognition target may exist, such as the anchor, the candidate frame, and the like. The candidate features are used for characterizing the prediction target and the prediction coordinates corresponding to the candidate region. The prediction target refers to a prediction result of the image recognition model on the recognition target, and the prediction coordinate refers to a coordinate prediction result of the image recognition model on the recognition target.

For example, the candidate feature extraction layer 313 may include a candidate region generation sublayer and a candidate feature extraction sublayer, where the candidate region generation sublayer generates a plurality of candidate regions at each point on the feature map according to a set rule, and the candidate feature extraction sublayer is configured to perform feature extraction on each candidate region respectively to obtain candidate features corresponding to each candidate region respectively. For example, the size of the feature map is h × w, each point on the feature map generates 9 candidate regions (the size of each candidate region is different), and the feature map corresponds to 9 × h × w candidate regions. In some examples, the candidate feature extraction sublayer may be a Roi Align layer (i.e., a Region of Interest Align layer), and then the candidate feature is a Roi feature.

The positive and negative sample division layer 314 is configured to divide candidate regions corresponding to the image sample to obtain a positive candidate region set and a negative candidate region set corresponding to the image sample, where the positive candidate region set may include one or more positive candidate regions, and the negative candidate region set may include one or more negative candidate regions. The input of the positive and negative sample classification layer 314 may include true annotation information, candidate regions, and candidate features corresponding to the image samples. Illustratively, the positive and negative sample division layer 314 determines a predicted coordinate corresponding to the candidate region based on the candidate feature corresponding to the candidate region, then traverses a real coordinate corresponding to the real annotation information, determines a most likely real coordinate corresponding to the predicted coordinate, and determines the candidate region as a positive candidate region if a degree of coincidence between the predicted coordinate and the most likely real coordinate corresponding to the predicted coordinate (e.g., IoU (Intersection over unit) value) satisfies a positive sample condition (e.g., is greater than a first threshold); if the matching degree between the predicted coordinate and the most likely corresponding real coordinate of the predicted coordinate satisfies a negative sample condition (if less than a second threshold), the candidate region is determined as a negative candidate region. And the recognition target corresponding to the image sample is predicted to exist in the positive candidate region, and the recognition target corresponding to the image sample is predicted not to exist in the negative candidate region.

The coordinate regression network 320 is used for obtaining a coordinate loss corresponding to the image sample, and the coordinate loss is used for representing a difference between the predicted coordinate and the real coordinate. The input of the coordinate regression network 320 may include candidate features corresponding to each positive candidate region, and the coordinate regression network 320 obtains the coordinate loss corresponding to the image sample based on the candidate features corresponding to each positive candidate region and the real coordinate most likely corresponding to each positive candidate region through a coordinate loss function. Alternatively, a Giou (Generalized Intersection over Union) Loss function, a Smooth L1 function, or the like may be used as the coordinate Loss function.

The category regression network 330 includes a plurality of classifiers, each corresponding to an identification target. For example, a two-classifier is correspondingly set for each recognition target. The classification regression network 330 is configured to obtain a classification loss corresponding to each classifier in the image recognition model, where the classification loss is used to represent a difference between the predicted target and the real target. In this embodiment of the present application, the classification loss may include a first classification loss and a second classification loss, where the first classification loss is used to characterize a difference between a prediction target and a real target respectively corresponding to each positive candidate region in the positive candidate region set; the second classification loss is used for characterizing the difference between the prediction target and the real target corresponding to each negative candidate region in the negative candidate region set. Alternatively, the category regression network 330 may perform the calculation of the classification Loss by using a Sigmoid Focal local function, a BCE (Binary Cross Entropy) local function, or the like.

For example, for each positive candidate region, the category regression network 330 may determine, based on the classifier control code, which classifier the positive candidate region needs to participate in the classification loss obtaining process corresponding to, and determine, based on the classifier control code, which classifier (e.g., the target classifier below) the negative candidate region needs to participate in the classification loss obtaining process corresponding to. For example, for each classifier, a first classification loss corresponding to the image sample is obtained based on the candidate features corresponding to the positive candidate regions corresponding to the image sample. And aiming at each target classifier, acquiring a second classification loss corresponding to the image sample based on the candidate features respectively corresponding to the negative candidate regions corresponding to the image sample. Each sample subset corresponds to a respective target classifier set in the image recognition model 30, and the target classifier corresponding to the target sample subset corresponds to the recognition target corresponding to the target sample subset.

The network architecture of the image recognition model is described above, and the training method of the image recognition model is described below.

Referring to fig. 3, a flowchart of a training method of an image recognition model according to an embodiment of the present application is shown. The execution subject of the steps of the method may be the model training apparatus described above. The method can comprise the following steps (301-305).

Step 301, an image sample set is obtained, where the image sample set includes a plurality of sample subsets, and there is a difference between identification targets respectively corresponding to the sample subsets.

The technical scheme provided by the embodiment of the application is suitable for a scene of multi-data combined training, namely, the image recognition model is subjected to combined training through a plurality of sample subsets in the image sample set to obtain the trained image recognition model, and the trained image recognition model can be used for recognizing the recognition target corresponding to the image sample set. The image recognition model in the embodiment of the present application may be used for tasks such as image recognition, target detection, image segmentation, and the like, which are the same as those described in the embodiment above, and parts that are not described in the embodiment of the present application may refer to the embodiment above, and are not described here again.

Exemplarily, taking target detection as an example, if the image sample set includes three sample subsets, a first sample subset corresponds to clothing detection, a second sample subset corresponds to vehicle detection, and a third sample subset corresponds to pedestrian detection, the image recognition model obtained through joint training of the three sample subsets is applicable to three tasks, namely clothing detection, vehicle detection, and pedestrian detection.

Optionally, each sample subset may correspond to different recognition targets, and there may also be partial overlap between the recognition targets corresponding to each sample subset, which is not limited in this application embodiment. The identification target may be set and adjusted according to actual use requirements, for example, the identification target may refer to a specific person (e.g., student a, student B, and the like), a role, a vehicle, a plant, and the like, and the identification target may also refer to a specific person, a specific animal (e.g., cat and dog), a specific vehicle, a specific plant, and the like.

The multiple sample subsets may be from the same scene or from different scenes. For example, in the case where multiple sample subsets are from the same video, the multiple sample subsets may correspond to different personas, respectively.

Step 302, for a target sample subset in a plurality of sample subsets, determining a target classifier set corresponding to the target sample subset from each classifier of the image recognition model; and the target classifiers in the target classifier set correspond to the identification targets corresponding to the target sample subset.

And each classifier of the image recognition model corresponds to a recognition target corresponding to the image sample set one by one. Alternatively, a classifier may be set for each recognition target corresponding to the image sample set. Under the condition that the recognition targets corresponding to the plurality of sample subsets do not overlap, the number of classifiers corresponding to the image recognition model is equal to the sum of the numbers of the recognition targets corresponding to the plurality of sample subsets. For example, if the sample subset a corresponds to the recognition target 1 and the recognition target 2, and the sample subset B corresponds to the recognition target 3 and the recognition target 4, the image recognition model has 4 classifiers. And under the condition that partial overlap exists between the recognition targets respectively corresponding to the plurality of sample subsets, the number of classifiers in the image recognition model is equal to the number of recognition targets after duplication elimination. For example, if the sample subset a corresponds to the recognition target 1 and the recognition target 2, and the sample subset B corresponds to the recognition target 1 and the recognition target 3, the image recognition model has 3 classifiers in total.

The target subset of samples may refer to any subset of samples in the plurality of subsets of samples, and the target subset of samples may correspond to one or more recognition targets. The target classifier set includes classifiers corresponding to the subset of the target samples, that is, the target classifier is a classifier corresponding to an identification target corresponding to the subset of the target samples. Optionally, the target sample subset corresponds to a non-target classifier set, a non-target classifier in the non-target classifier set corresponds to an identification target corresponding to the image sample set except for the identification target corresponding to the target sample subset, and the non-target classifier set includes classifiers except for the target classifier in each classifier, that is, the non-target classifier refers to classifiers other than the target classifier. For example, a sample subset a and a sample subset B are provided, and for the sample subset a, the classifier corresponding to the sample subset a forms a target classifier set corresponding to the sample subset a, and the classifier corresponding to the sample subset B forms a non-target classifier set corresponding to the sample subset a. Optionally, a target classifier set and a non-target classifier set corresponding to the target sample subset may be determined based on the classifier control code under the positive candidate region corresponding to the target sample subset.

In one example, the acquisition process for the set of target classifiers may be as follows: obtaining a classifier control code corresponding to the target sample subset, wherein the classifier control code is used for determining the participation state of each classifier of the image recognition model in the classification loss obtaining process; determining a target classifier set from each classifier of the image recognition model according to the classifier control code; and the classifier control code is determined based on the recognition target corresponding to the image sample set and the classifier distribution corresponding to the image recognition model.

Optionally, the classifier control code may include a first element and a second element, the first element and the second element being different, and the elements corresponding to the classifier control code correspond to the classifiers in the image recognition model in a one-to-one correspondence. The classifier corresponding to the first element can be determined as a target classifier, and the target classifier is set to be in an participation state; determining the classifier corresponding to the second element as a non-target classifier, and setting the non-target classifier to be in a non-participation state; the non-target classifier is a classifier excluding the target classifier in each classifier.

Exemplarily, referring to fig. 4, let a subset of target samples in the image sample set correspond to recognition target 1, recognition target 2, and recognition target 3, and there are 5 different recognition targets corresponding to the image sample set, and the image recognition model corresponds to 5 classifiers, and the subset of target samples may correspond to classifier 1, classifier 2, and classifier 3 of the 5 classifiers. For example, the first element is 1 and the second element is 0. Since the positive candidate region corresponding to the image sample participates in the training of all classifiers, the classifier control code corresponding to the positive candidate region may be set to [1, 1, 1, 1, 1] for the target sample subset, and since the negative candidate region corresponding to the image sample participates in the training of only the target classifier corresponding to the target sample subset, the classifier control code corresponding to the negative candidate region may be set to [1, 1, 1, 0, 0] for the target sample subset. Wherein, 1 indicates that the corresponding classifier participates in the training, and 0 indicates that the corresponding classifier does not participate in the training of the negative candidate region corresponding to the image sample.

Alternatively, the generation of the classifier control code may also be performed after the positive candidate region and the negative candidate region of the target image sample are determined. Illustratively, a classifier control code is set for each negative candidate region respectively to indicate which classifiers need to participate in the classification loss acquisition process corresponding to each negative candidate region. For each positive candidate region, since it can participate in the classification loss acquisition process corresponding to any classifier, no classifier control code may be additionally set, or a classifier control code whose elements are all the first elements may be set by default.

Therefore, the positive candidate region in a certain sample subset can take effect in other sample subsets through the classifier control code, and the negative candidate region in the certain sample subset only takes effect in the sample subset, so that the problem that the training effect is inhibited due to the fact that recognition target conflicts exist among all sample subsets is solved, and the recognition accuracy of the image recognition model under the multi-data combined training scene is improved.

Step 303, for a target image sample in the target sample subset, determining, by each classifier, a first classification loss corresponding to the target image sample based on each positive candidate region corresponding to the target image sample; and predicting that the identification target corresponding to the image sample exists in the positive candidate area.

The target image sample may refer to any image sample in a subset of target samples. The first classification loss is used for representing the difference between the predicted target and the real target corresponding to each positive candidate region. The positive candidate region and the negative candidate region in the embodiment of the present application are the same as those described in the above embodiment, and the content not described in this embodiment refers to the above embodiment, which is not described herein again.

In an example, the positive candidate region and the negative candidate region may be obtained by the positive/negative sample division layer 314 in the above embodiment, and the specific contents thereof may be as follows:

1. and acquiring a real coordinate set corresponding to the target image sample based on the real annotation information corresponding to the target image sample, wherein the real coordinate set comprises real coordinates of each identification target corresponding to the target image sample.

The real labeling information includes real coordinate labeling information and real target labeling information corresponding to the recognition targets, and optionally, the real coordinates corresponding to the recognition targets may be determined based on real coordinate labeling information corresponding to the recognition targets in the real coordinate information.

2. Determining a prediction coordinate corresponding to a target candidate region in a plurality of candidate regions corresponding to the target image sample based on candidate features corresponding to the target candidate region; the candidate area is used for indicating the possible distribution area of the identification target corresponding to the image sample in the image sample.

The target candidate region may refer to any one of a plurality of candidate regions. For example, based on a feature map at h × w scale corresponding to the target image sample, 9 × h × w candidate regions are obtained, and feature extraction is performed on the 9 × h × w candidate regions, so as to obtain candidate features corresponding to the 9 × h × w candidate regions respectively. And performing logistic regression on candidate features corresponding to the target candidate regions in the 9 x h w candidate regions to obtain the prediction coordinates corresponding to the target candidate regions. And the candidate features are used for characterizing the prediction target and the prediction coordinates corresponding to the candidate region.

3. Calculating the predicted coordinates corresponding to the target candidate region with each real coordinate in the real coordinate set respectively to obtain the predicted coordinates corresponding to the target candidate region and the coincidence degree value between each predicted coordinate and each real coordinate respectively; wherein the degree of coincidence value is used for representing the degree of coincidence between the predicted coordinate and the real coordinate.

For example, the degree of match is IoU. And calculating the prediction coordinates corresponding to the target candidate region and each real coordinate in the real coordinate set corresponding to the target image sample through a calculation formula of IoU values to obtain IoU values of the target candidate region under each real coordinate.

4. And if the maximum matching degree value corresponding to the target candidate region is greater than a first threshold value, determining the target candidate region as a positive candidate region.

The first threshold may be set and adjusted according to actual use requirements, such as 0.5, 0.6, 0.7, and the larger the first threshold, the better the quality of the positive candidate region.

For example, a maximum IoU value is determined from IoU values of the target candidate region at each real coordinate, the real coordinate corresponding to the maximum IoU value is the most likely real coordinate corresponding to the target candidate region, and the predicted target corresponding to the target candidate region is most likely the real target corresponding to the real coordinate. If the maximum IoU value is greater than 0.5, the target candidate region is determined to be a positive candidate region. Alternatively, the real annotation information corresponding to the maximum IoU value may be determined as the real annotation information corresponding to the target candidate region.

5. And if the maximum degree of coincidence corresponding to the target candidate region is smaller than a second threshold value, determining the target candidate region as a negative candidate region.

The second threshold may be set and adjusted according to actual use requirements, such as 0.3, 0.4, and the smaller the first threshold, the better the quality of the negative candidate region. Wherein the first threshold is greater than the second threshold.

For example, if the maximum IoU value is less than 0.4, the target candidate region is determined to be a negative candidate region.

Optionally, for the positive candidate region, the corresponding first classification loss includes two parts, one part corresponds to the target classifier corresponding to the current sample subset, and the other part corresponds to the non-target classifier corresponding to the other sample subset except the current sample subset.

In one example, the specific acquisition procedure of the first classification loss may be as follows:

1. and acquiring the prediction targets of the positive candidate regions under the classifiers respectively based on the candidate features respectively corresponding to the positive candidate regions.

For example, for a first classifier of the multiple classifiers, the prediction targets of the respective positive candidate regions under the first classifier can be obtained by the first classifier based on the candidate features respectively corresponding to the respective positive candidate regions. The first classifier may refer to any one of the respective classifiers.

2. And for a first target classifier in a target classifier set corresponding to the target sample subset, determining the classification loss of each positive candidate region under the first target classifier respectively based on the predicted target of each positive candidate region under the first target classifier respectively and the real target corresponding to each positive candidate region respectively.

The first target classifier may refer to any one of a set of target classifiers.

For example, for a target positive candidate area in the plurality of positive candidate areas, the classification Loss of the target positive candidate area under the first target classifier is determined through a Sigmoid Focal local function based on a predicted target of the target positive candidate area under the first target classifier and a real target corresponding to the target positive candidate area.

3. And determining the sum of the classification losses of the positive candidate regions under the first target classifier as the classification loss corresponding to the first target classifier.

4. For a first non-target classifier in a non-target classifier set corresponding to a target sample subset, determining classification loss of each positive candidate region under the first non-target classifier based on a predicted target of each positive candidate region under the first non-target classifier and second labeling information corresponding to each positive candidate region; the second marking information is used for indicating that the prediction target corresponding to the positive candidate area is not the recognition target corresponding to the sample subset except the target sample subset.

The first non-target classifier may refer to any non-target classifier in the set of non-target classifiers. For example, for a target positive candidate area in each positive candidate area, the classification Loss of the target positive candidate area under the first non-target classifier is determined through a Sigmoid Focal local function based on the predicted target of the target positive candidate area under the first non-target classifier and the second label information corresponding to the target positive candidate area.

The second labeling information may be determined according to the recognition target corresponding to the target sample subset corresponding to the positive candidate region. For example, the sample subset 1 corresponds to two recognition targets, i.e., a and B, and the sample subset 2 corresponds to two recognition targets, i.e., C and D, and for the positive candidate region corresponding to the sample subset 1, the probability is a or B, but not C or D, so that in the non-target classifier calculation scenario, the real targets corresponding to the positive candidate region may be set to 0, so as to characterize that the target corresponding to the positive candidate region is not C or D.

5. And determining the sum of the classification losses of the positive candidate regions under the first non-target classifier as the classification loss corresponding to the first non-target classifier.

6. And determining a first classification loss corresponding to the target image sample based on the classification loss corresponding to each target classifier in the target classifier set and the classification loss corresponding to each non-target classifier in the non-target classifier set.

Optionally, the classification loss corresponding to each target classifier and the sum of the classification losses corresponding to each non-target classifier may be determined as the first classification loss corresponding to the target image sample.

Step 304, determining a second classification loss corresponding to the target image sample based on each negative candidate region corresponding to the target image sample only by the target classifier in the target classifier set; and predicting that no recognition target corresponding to the image sample exists in the negative candidate region.

And the second classification loss is used for representing the difference between the prediction target and the real target respectively corresponding to each negative candidate region. The non-target classifiers in the set of non-target classifiers do not participate in the acquisition process of the second classification loss. Optionally, the non-target classifier may not directly classify the negative candidate region to achieve that the negative candidate region does not participate in the acquisition of the second classification loss, and the non-target classifier may also classify the negative candidate region, but the obtained prediction target does not participate in the acquisition of the second classification loss to achieve that the negative candidate region does not participate in the acquisition process of the second classification loss.

In one example, the second classification loss may be obtained as follows:

1. and respectively setting the label information corresponding to each negative candidate area as first label information, wherein the first label information is used for indicating that the prediction target corresponding to the negative candidate area is not the recognition target corresponding to the target sample subset.

For example, if the sample subset 1 corresponds to two recognition targets, i.e., a recognition target a and a recognition target B, and the sample subset 2 corresponds to two recognition targets, i.e., a recognition target C and a recognition target D, then the negative candidate regions corresponding to the sample subset 1 have a high probability of being not a and B, and therefore when the classification loss corresponding to the negative candidate regions is obtained, the real targets corresponding to the negative candidate regions may be set to 0, so as to represent that the targets corresponding to the negative candidate regions are not a and B. Because it is uncertain whether the predicted target corresponding to the negative candidate region is C or D, the negative candidate region corresponding to the sample subset 1 does not participate in the training of the target classifier corresponding to the sample subset 2.

2. For a first target classifier in the target classifier set, determining a predicted target of each negative candidate region under the first target classifier respectively based on the candidate features corresponding to each negative candidate region respectively; and the candidate features are used for characterizing the prediction target and the prediction coordinates corresponding to the candidate region.

The first target classifier may refer to any one of a set of target classifiers. For example, for a target negative candidate region in the negative candidate regions, the target negative candidate region may be classified by the first target classifier based on candidate features corresponding to the target negative candidate region, so as to obtain a predicted target of the target negative candidate region under the first target classifier.

3. And determining the classification loss of each negative candidate region under the first target classifier based on the prediction target of each negative candidate region under the first target classifier and the first marking information.

For example, for a target negative candidate region in the negative candidate regions, the classification Loss of the target negative candidate region under the first target classifier is determined through a Sigmoid Focal local function based on the predicted target of the target negative candidate region under the first target classifier and the first label information.

4. And determining the sum of the classification losses of the negative candidate regions under the first target classifier as the classification loss corresponding to the first target classifier.

5. And determining second classification losses corresponding to the target image samples based on the classification losses respectively corresponding to the target classifiers in the target classifier set.

Optionally, the sum of the classification losses respectively corresponding to the target classifiers in the target classifier set may be determined as the second classification loss corresponding to the target image sample.

Optionally, the first classification loss and the second classification loss may be obtained simultaneously or sequentially, which is not limited in this embodiment of the application.

Step 305, training the image recognition model based on the first classification loss and the second classification loss respectively corresponding to each image sample in the image sample set to obtain a trained image recognition model, wherein each classifier in the trained image recognition model is used for recognizing a recognition target corresponding to the image sample set.

Optionally, the remaining image samples are processed by the same method as the target image sample, so as to obtain a first classification loss and a second classification loss corresponding to the remaining image samples, respectively.

In one example, the image recognition model may also correspond to a coordinate loss used to characterize the difference between the predicted coordinates and the real coordinates, and the coordinate loss may be obtained as follows:

determining the real coordinates corresponding to the positive candidate region under the maximum matching degree value corresponding to the positive candidate region as the real coordinates corresponding to the positive candidate region; determining the prediction coordinates corresponding to the positive candidate regions respectively based on the candidate features corresponding to the positive candidate regions respectively; and determining the coordinate loss corresponding to the target image sample based on the predicted coordinate and the real coordinate respectively corresponding to each positive candidate region, wherein the coordinate loss is used for representing the difference between the predicted coordinate and the real coordinate.

For example, for the target positive candidate region, the coordinate Loss corresponding to the target positive candidate region is determined based on the predicted coordinate and the real coordinate corresponding to the target positive candidate region through the Giou Loss function, and the sum of the coordinate losses corresponding to the positive candidate regions is determined as the coordinate Loss corresponding to the target image sample.

In one example, the training process for the image recognition model may be as follows: training a coordinate regression network in the image recognition model based on the coordinate loss corresponding to each image sample, training a category regression network in the image recognition model based on the first classification loss and the second classification loss corresponding to each image sample, and training the residual network in the image recognition model based on the coordinate loss, the first classification loss and the second classification loss corresponding to each image sample to obtain the trained image recognition model.

Wherein the class regression network includes classifiers of the image recognition model. The remaining networks in the image recognition model refer to networks other than the category regression network and the coordinate regression network, such as the positive and negative candidate region acquisition network 310 described above. And for a non-target classifier in the class regression network, training the non-target classifier based on the first classification loss corresponding to the image sample. And performing iterative training on the image recognition model by taking the minimized coordinate loss, the first classification loss and the second classification loss as targets to obtain the trained image recognition model.

Optionally, in the using process of the trained image recognition model, only the input image needs to be input into the trained image recognition model, so that the coordinate prediction result output by the coordinate regression network and the target prediction result output by each classifier can be obtained, and the target prediction result with the highest probability value in each target prediction result can be determined as the final target prediction result.

In summary, according to the technical solution provided by the embodiment of the present application, the target classifier corresponding to the target sample subset is determined from the classifiers corresponding to the image recognition model, and determining, by the target classifier, a classification loss of the negative candidate region under the target classifier based on the negative candidate region (e.g., anchor negative sample) corresponding to the image sample in the target sample subset, so that the negative candidate regions corresponding to the target sample subset do not participate in the training of classifiers other than the target classifier, therefore, the negative candidate region corresponding to the target sample subset only takes effect in the target sample subset, the problem that training effect is inhibited due to recognition target conflict among image sample sets in the related technology is solved, and the recognition accuracy of the image recognition model in a multi-data combined training scene is improved.

In an exemplary embodiment, a training method of an image recognition model is described, which may include the following steps, for example, when the image recognition model is used to detect a character in a video.

For the target video, the image recognition model is only needed to detect the role a, the role B, the role C and the role D (i.e. the recognition target) in the initial stage. Then, for the role a, the role B, the role C and the role D, the image sample set 1 with the real annotation information is sorted out based on the target video. With the deepening of business, new identification targets are needed: and a role E, a role F and a role G, and sorting out an image sample set 2 with real annotation information based on the target video aiming at the role E, the role F and the role G.

The image sample set 1 does not have real labeling information for the role E, the role F and the role G, and the image sample set 2 does not have real labeling information for the role A, the role B, the role C and the role D.

In a new training process, the image sample set 1 and the image sample set 2 may serve as a sample subset 1 and a sample subset 2 in the image sample set corresponding to the image recognition model, so that the image recognition model may be jointly trained based on the image sample set 1 and the image sample set 2. The specific content can be as follows:

the image recognition model is reset to have 7 two classifiers, which correspond to role a, role B, role C, role D, role E, role F, and role G in this order.

For the image sample set 1, the classifier control code corresponding to the positive candidate region may be set to [1, 1, 1, 1, 1, 1, 1, 1], and the classifier control code corresponding to the negative candidate region may be set to [1, 1, 1, 1, 0, 0, 0 ]. For the image sample set 2, the classifier control code corresponding to the positive candidate region may be set to [1, 1, 1, 1, 1, 1, 1, 1], and the classifier control code corresponding to the negative candidate region may be set to [0, 0, 0, 0, 1, 1, 1 ].

For the positive candidate region in the image sample set 1, the corresponding recognition target (i.e. role a or role B or role C or role D) can be explicitly known, and the positive candidate region in the image sample set 1 can participate in the training of all the two classifiers. For the negative candidate region in the image sample set 1, it is only known that it is impossible to be role a, role B, role C, and role D, and it is uncertain whether it is role E or role F or role G. Therefore, the negative candidate regions in the image sample set 1 only participate in the training of the classifiers corresponding to the image sample set 1. This idea also applies to the set of image samples 2.

The following describes a training process of the image recognition model by taking the image sample set 1 as an example.

For a target image sample in the image sample set 1, a positive candidate region set and a negative candidate region set corresponding to the target image sample are obtained through an image recognition model. The positive candidate region set includes a plurality of positive candidate regions, and the negative candidate region set includes a plurality of negative candidate regions.

And for the two classifiers 1-4 respectively corresponding to the role A, the role B, the role C and the role D, acquiring positive classification losses under the two classifiers 1-4 based on candidate characteristics and real labeling information respectively corresponding to each positive candidate region in the positive candidate region set. And for the two classifiers 5-7 corresponding to the role E, the role F and the role G respectively, acquiring the positive classification loss under the two classifiers 5-7 based on the candidate characteristics and the second labeling information corresponding to each positive candidate region in the positive candidate region set respectively. And determining the first classification loss corresponding to the target image sample based on the positive classification loss under the two classifiers 1-4 and the positive classification loss under the two classifiers 5-7.

According to the classifier control codes corresponding to the negative candidate areas, the two classifiers 1-4 corresponding to the role A, the role B, the role C and the role D are set to be in the participation state, and the two classifiers 5-7 corresponding to the role E, the role F and the role G are set to be in the non-participation state. And acquiring negative classification losses under two classifiers 1-4 based on the candidate characteristics and the first labeling information respectively corresponding to each negative candidate region in the negative candidate region set, and determining a second classification loss corresponding to the target image sample based on the negative classification losses under the two classifiers 1-4.

And acquiring the coordinate loss corresponding to the target image sample based on the candidate features and the real coordinates of each positive candidate region in the positive candidate region set.

Training a coordinate regression network in the image recognition model based on coordinate loss, training the two classifiers 1-4 based on positive classification loss and negative classification loss under the two classifiers 1-4, and training the two classifiers 5-7 based on positive classification loss under the two classifiers 5-7. And performing joint iterative training on the image recognition model based on the image samples in the image sample set 1 and the image sample set 2 to obtain the trained image recognition model. The trained image recognition model can be used for detecting a role A, a role B, a role C, a role D, a role E, a role F and a role G.

In another example, referring to fig. 5, the image recognition model is applied to a traffic scene, and after the image recognition model is jointly trained through an image sample set corresponding to a detection task supporting a train and a track and an image sample set corresponding to a detection task supporting a sky, a signboard and a signboard column, the obtained image recognition model can detect the train, the track, the sky, the signboard and the signboard. For example, the image recognition model may recognize trains, tracks, sky, signs, and sign posts from the input pattern 501.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 6, a block diagram of an apparatus for training an image recognition model according to an embodiment of the present application is shown. The device can be used for realizing the training method of the image recognition model. The apparatus 600 may include: a sample set acquisition module 601, a target classifier determination module 602, a classification loss acquisition module 603, and a recognition model training module 604.

The sample set acquiring module 601 is configured to acquire an image sample set, where the image sample set includes a plurality of sample subsets, and there is a difference between identification targets respectively corresponding to the sample subsets.

A target classifier determining module 602, configured to determine, for a target sample subset of the plurality of sample subsets, a target classifier set corresponding to the target sample subset from among the classifiers of the image recognition model; wherein a target classifier in the set of target classifiers corresponds to an identified target to which the subset of target samples corresponds.

A classification loss obtaining module 603, configured to determine, for a target image sample in the target sample subset, a first classification loss corresponding to the target image sample based on each positive candidate region corresponding to the target image sample through each classifier; wherein, the identification target corresponding to the image sample is predicted to exist in the positive candidate area.

The classification loss obtaining module 603 is further configured to determine, based on each negative candidate region corresponding to the target image sample, a second classification loss corresponding to the target image sample only by using a target classifier in the target classifier set; and predicting that no recognition target corresponding to the image sample exists in the negative candidate region.

The recognition model training module 604 is configured to train the image recognition model based on a first classification loss and a second classification loss respectively corresponding to each image sample in the image sample set, so as to obtain the trained image recognition model, where each classifier in the trained image recognition model is used to recognize a recognition target corresponding to the image sample set.

In an exemplary embodiment, the classification loss obtaining module 603 is configured to:

respectively setting the label information corresponding to each negative candidate region as first label information, where the first label information is used to indicate that the prediction target corresponding to the negative candidate region is not the recognition target corresponding to the target sample subset;

for a first target classifier in the target classifier set, determining a predicted target of each negative candidate region under the first target classifier respectively based on the candidate features corresponding to each negative candidate region respectively; the candidate features are used for characterizing the prediction target and the prediction coordinates corresponding to the candidate region;

determining classification loss of each negative candidate region under the first target classifier respectively based on the predicted target of each negative candidate region under the first target classifier respectively and the first labeling information;

determining the sum of the classification losses of the negative candidate regions under the first target classifier as the classification loss corresponding to the first target classifier;

and determining second classification losses corresponding to the target image samples based on the classification losses respectively corresponding to the target classifiers in the target classifier set.

In an exemplary embodiment, the target classifier determination module 602 is configured to:

obtaining a classifier control code corresponding to the target sample subset, wherein the classifier control code is used for determining the participation state of each classifier of the image recognition model in the training process;

determining the target classifier set and the non-target classifier set from each classifier of the image recognition model according to the classifier control code;

wherein the classifier control code is determined based on the recognition target corresponding to the image sample set and the classifier distribution corresponding to the image recognition model.

In an exemplary embodiment, the classifier control code includes a first element and a second element, the first element and the second element are different, and the corresponding elements of the classifier control code correspond to classifiers in the image recognition model in a one-to-one correspondence manner; the target classifier determination module 602 is further configured to:

obtaining a classifier control code corresponding to the target sample subset, wherein the classifier control code is used for determining the participation state of each classifier of the image recognition model in the classification loss obtaining process;

determining the target classifier set from each classifier of the image recognition model according to the classifier control code; wherein the classifier control code is determined based on the recognition target corresponding to the image sample set and the classifier distribution corresponding to the image recognition model.

In an exemplary embodiment, the classification loss obtaining module 603 is further configured to:

acquiring the prediction targets of the positive candidate regions under the classifiers based on the candidate features respectively corresponding to the positive candidate regions;

for a first target classifier in a target classifier set corresponding to the target sample subset, determining a classification loss of each positive candidate region under the first target classifier respectively based on a predicted target of each positive candidate region under the first target classifier respectively and a real target corresponding to each positive candidate region respectively;

determining the sum of the classification losses of the positive candidate regions under the first target classifier as the classification loss corresponding to the first target classifier;

for a first non-target classifier in a non-target classifier set corresponding to the target sample subset, determining classification loss of each positive candidate region under the first non-target classifier respectively based on a predicted target of each positive candidate region under the first non-target classifier respectively and second labeling information corresponding to each positive candidate region respectively; wherein the second label information is used for indicating that the prediction target corresponding to the positive candidate region is not an identification target corresponding to a sample subset except the target sample subset;

determining the sum of the classification losses of the positive candidate regions under the first non-target classifier as the classification loss corresponding to the first non-target classifier;

and determining a first classification loss corresponding to the target image sample based on the classification loss corresponding to each target classifier in the target classifier set and the classification loss corresponding to each non-target classifier in the non-target classifier set.

In an exemplary embodiment, as shown in fig. 7, the apparatus 600 further includes: a real coordinate obtaining module 605, a predicted coordinate obtaining module 606, a coincidence value obtaining module 607, a positive sample obtaining module 608, and a negative sample obtaining module 609.

A real coordinate obtaining module 605, configured to obtain, based on the real annotation information corresponding to the target image sample, a real coordinate set corresponding to the target image sample, where the real coordinate set includes real coordinates of each identification target corresponding to the target image sample.

A predicted coordinate obtaining module 606, configured to determine, for a target candidate region in the multiple candidate regions corresponding to the target image sample, a predicted coordinate corresponding to the target candidate region based on a candidate feature corresponding to the target candidate region; wherein the candidate region is used for indicating a possible distribution region of the identification target corresponding to the image sample in the image sample.

An coincidence value obtaining module 607, configured to calculate the predicted coordinates corresponding to the target candidate region with each real coordinate in the real coordinate set, respectively, to obtain the predicted coordinates corresponding to the target candidate region, and coincidence degree values between the predicted coordinates and each real coordinate, respectively; wherein the degree of agreement value is used to characterize the degree of agreement between the predicted coordinate and the real coordinate.

A positive sample obtaining module 608, configured to determine the target candidate region as the positive candidate region if the maximum degree of matching value corresponding to the target candidate region is greater than a first threshold.

A negative sample obtaining module 609, configured to determine the target candidate region as the negative candidate region if the maximum degree of fit value corresponding to the target candidate region is smaller than a second threshold; wherein the first threshold is greater than the second threshold.

In an exemplary embodiment, as shown in fig. 7, the apparatus 600 further includes: a real coordinate determination module 610 and a coordinate loss acquisition module 611.

A real coordinate determining module 610, configured to determine a real coordinate of the positive candidate region under the maximum matching degree value corresponding to the positive candidate region as the real coordinate corresponding to the positive candidate region.

The predicted coordinate obtaining module 606 is further configured to determine the predicted coordinates corresponding to each positive candidate region based on the candidate features corresponding to each positive candidate region.

A coordinate loss obtaining module 611, configured to determine, based on the predicted coordinate and the real coordinate respectively corresponding to each positive candidate region, a coordinate loss corresponding to the target image sample, where the coordinate loss is used to represent a difference between the predicted coordinate and the real coordinate.

In an exemplary embodiment, the recognition model training module 604 is configured to:

training a coordinate regression network in the image recognition model based on the coordinate loss corresponding to each image sample, training a category regression network in the image recognition model based on the first classification loss and the second classification loss corresponding to each image sample, and training the remaining networks in the image recognition model based on the coordinate loss, the first classification loss and the second classification loss corresponding to each image sample to obtain the trained image recognition model;

wherein the class regression network comprises respective classifiers of the image recognition model.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments, which are not described herein again.

Referring to fig. 8, a schematic structural diagram of a computer device according to an embodiment of the present application is shown. The computer device may be any electronic device with data computing, processing and storing capabilities that can be implemented as the model training device 10 and/or the model using device 20 in the environment of implementation of the embodiment shown in fig. 1. Specifically, the following may be included.

The computer apparatus 800 includes a Central Processing Unit (e.g., a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), an FPGA (Field Programmable Gate Array), etc.) 801, a system Memory 804 including a RAM (Random-Access Memory) 802 and a ROM (Read-Only Memory) 803, and a system bus 805 connecting the system Memory 804 and the Central Processing Unit 801. The computer device 800 also includes a basic Input/Output System (I/O System) 806 for facilitating information transfer between devices within the server, and a mass storage device 807 for storing an operating System 813, application programs 814, and other program modules 815.

In some embodiments, the basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse, keyboard, etc. for a user to input information. Wherein the display 808 and the input device 809 are connected to the central processing unit 801 through an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the computer device 800. That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, the computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical, magnetic, or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory.

The computer device 800 may also operate as a remote computer connected to a network via a network, such as the internet, in accordance with embodiments of the present application. That is, the computer device 800 may be connected to the network 812 through the network interface unit 811 coupled to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the above-described method of training an image recognition model.

In an exemplary embodiment, a computer-readable storage medium is also provided, in which a computer program is stored, which, when being executed by a processor, is adapted to carry out the above-mentioned training method for an image recognition model.

Optionally, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State drive), or optical disk. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises computer instructions, which are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and executes the computer instructions to cause the computer device to execute the training method of the image recognition model.

It should be noted that the information (including but not limited to the subject equipment information, subject personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals referred to in this application are authorized by the subject or fully authorized by each party, and the collection, use and processing of the relevant data are in compliance with relevant laws and regulations and standards in relevant countries and regions. For example, the image samples, videos, real annotation information, etc. referred to in this application are obtained with sufficient authorization.

It should be understood that reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only show an exemplary possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the illustrated sequence, which is not limited in this application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for training an image recognition model, the method comprising:

determining, by only a target classifier in the target classifier set, a second classification loss corresponding to the target image sample based on each negative candidate region corresponding to the target image sample; wherein, no recognition target corresponding to the image sample is predicted in the negative candidate region;

2. The method of claim 1, wherein determining, by only a target classifier of the set of target classifiers, a second classification loss corresponding to the target image sample based on each negative candidate region corresponding to the target image sample comprises:

3. The method of claim 1, wherein determining a target classifier set corresponding to the target sample subset from the classifiers of the image recognition model comprises:

determining the target classifier set from each classifier of the image recognition model according to the classifier control code;

4. The method of claim 3, wherein the classifier control code comprises a first element and a second element, the first element and the second element are different, and the corresponding elements of the classifier control code correspond to classifiers in the image recognition model in a one-to-one manner;

determining the target classifier set from each classifier of the image recognition model according to the classifier control code, including:

determining a classifier corresponding to the first element as the target classifier, and setting the target classifier to be in a participation state;

determining the classifier corresponding to the second element as a non-target classifier, and setting the non-target classifier to be in a non-participation state; wherein the non-target classifier is a classifier excluding the target classifier from the classifiers.

5. The method of claim 1, wherein determining, by the respective classifiers, a first classification loss for the target image sample based on the respective positive candidate region for the target image sample comprises:

for a first target classifier in a target classifier set corresponding to the target sample subset, determining classification loss of each positive candidate region under the first target classifier based on a predicted target of each positive candidate region under the first target classifier and a real target corresponding to each positive candidate region;

6. The method of claim 1, further comprising:

acquiring a real coordinate set corresponding to the target image sample based on real annotation information corresponding to the target image sample, wherein the real coordinate set comprises real coordinates of each identification target corresponding to the target image sample;

for a target candidate region in a plurality of candidate regions corresponding to the target image sample, determining a prediction coordinate corresponding to the target candidate region based on a candidate feature corresponding to the target candidate region; wherein the candidate region is used for indicating a possible distribution region of the identification target corresponding to the image sample in the image sample;

calculating the prediction coordinates corresponding to the target candidate region with each real coordinate in the real coordinate set respectively to obtain the prediction coordinates corresponding to the target candidate region and the coincidence degree value between each prediction coordinate and each real coordinate respectively; wherein the degree of agreement value is used for representing the degree of agreement between the predicted coordinate and the real coordinate;

if the maximum matching degree value corresponding to the target candidate region is larger than a first threshold value, determining the target candidate region as the positive candidate region;

if the maximum degree of coincidence corresponding to the target candidate region is smaller than a second threshold value, determining the target candidate region as the negative candidate region;

wherein the first threshold is greater than the second threshold.

7. The method of claim 6, further comprising:

determining the real coordinate under the maximum matching degree value corresponding to the positive candidate area as the real coordinate corresponding to the positive candidate area;

determining the prediction coordinates corresponding to the positive candidate regions respectively based on the candidate features corresponding to the positive candidate regions respectively;

and determining the coordinate loss corresponding to the target image sample based on the predicted coordinate and the real coordinate respectively corresponding to each positive candidate region, wherein the coordinate loss is used for representing the difference between the predicted coordinate and the real coordinate.

8. The method according to any one of claims 1 to 7, wherein the training the image recognition model based on the first classification loss and the second classification loss respectively corresponding to each image sample in the image sample set to obtain the trained image recognition model comprises:

9. An apparatus for training an image recognition model, the apparatus comprising:

the classification loss acquisition module is further configured to determine, based on each negative candidate region corresponding to the target image sample, a second classification loss corresponding to the target image sample only by using a target classifier in the target classifier set; wherein, no recognition target corresponding to the image sample is predicted in the negative candidate region;

10. A computer device, characterized in that the computer device comprises a processor and a memory, in which a computer program is stored, which computer program is loaded and executed by the processor to implement the method of training an image recognition model according to any of claims 1 to 8.

11. A computer-readable storage medium, in which a computer program is stored which is loaded and executed by a processor to implement the method of training an image recognition model according to any one of claims 1 to 8.