CN112686344B

CN112686344B - Detection model for rapidly filtering background picture and training method thereof

Info

Publication number: CN112686344B
Application number: CN202110299944.3A
Authority: CN
Inventors: 王威
Original assignee: Zhejiang Zhuoyun Intelligent Technology Co ltd
Current assignee: Zhejiang Zhuoyun Intelligent Technology Co ltd
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2021-07-02
Anticipated expiration: 2041-03-22
Also published as: CN112686344A

Abstract

The invention discloses a detection model for rapidly filtering background pictures, which comprises a backbone network, a classification head module, a feature fusion module, a region recommendation network, region-of-interest pooling and a cascade detector, wherein the classification head module is connected with the backbone network; the classification head module and the detection model share a main network, the feature map of the last layer of the main network is calculated by the classification head module to obtain classification confidence, and whether the feature map enters the detection module is determined according to a classification result. The model can effectively improve the calculation efficiency of the classification model, and has simple structure and high feasibility.

Description

Detection model for rapidly filtering background picture and training method thereof

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a detection model for rapidly filtering background pictures and a training method thereof.

Background

In recent years, with the continuous development of artificial intelligence technology, deep learning technology has made breakthrough progress in the tasks of classification, identification, detection, segmentation, tracking and the like in the field of computer vision. Compared with the traditional machine vision method, the deep convolutional neural network learns useful characteristics from a large amount of data under the training of big data, and has the advantages of high speed, high precision, low cost and the like. However, a large part of the reason why the deep learning is superior to the conventional method is that the deep learning is based on a large amount of data, and the learning of the data requires a large amount of computing resources, and if the deep learning is applied to an object detection application scenario (such as intelligent security inspection, industrial detection, medical field, etc.) in which a large amount of background exists, most of the pictures detected by the deep learning model are background images, and especially for a simple background, most of the computation of the detection model is unnecessary, which causes a waste of computing resources.

Disclosure of Invention

The invention aims to solve the technical problem of providing a detection model for rapidly filtering background pictures and a training method thereof aiming at the technical defects related to the background technology, and greatly improves the detection efficiency on the premise of keeping the detection accuracy.

According to one aspect of the invention, a detection model for rapidly filtering background pictures is provided, and the detection model comprises a backbone network, a classification head module, a feature fusion module, a region recommendation network, region-of-interest pooling and a cascade detector.

The main network is used for extracting characteristic information from the input image, and the output end of the main network is connected with the classification head module.

The classification head module is used for classifying the feature information extracted by the main network so as to obtain the positive confidence coefficient of the picture, selecting subsequent operation according to the result of the classification head module, if the result is positive, sending the feature extraction information of the main network into the feature fusion module, and if the result is negative, directly outputting the detection result, namely, the picture is negative and the target is not detected.

The feature fusion module is used for further fusing features extracted by the backbone network, and obtaining a fused feature map after passing through the feature fusion module, wherein the feature map is connected with the regional suggestion network. The area recommendation network is used for preliminarily filtering the candidate areas to obtain the interested areas, and the interested area pooling layer fixes the obtained features of the interested areas to be the same in size.

And the cascade detector further classifies the interested region in the last step and regresses to output a final detection result.

The classification head module structure is that after a layer of 3 x 3 convolution, after a layer of 1 x 1 convolution dimension reduction, input images of different sizes are enabled to obtain features of the same dimension by using self-adaptive global pooling, and finally output after a layer of full connection layer, an activation function used by an output layer is sigmoid, the number of output neurons is equal to the number of target categories, and a loss function during classification adopts cross entropy loss.

Further preferably, the label coding method during the classification header training is as follows: and setting the number of the neurons output by the classification head to be the same as the number of the classes of the detection tasks, wherein if a certain class of target exists in the detected picture, the label of the corresponding neuron is 1, and if the certain class of target does not exist, the label of the corresponding neuron is 0.

According to another technical scheme of the invention, a model training method is provided for training the detection model, a backbone network is set as a module A, a classification head module is set as a module B, the feature fusion module, the area recommendation network and the region of interest pooling are combined into a module C, and the following alternate training mode is adopted during model training:

(a) training module A and module B as a whole m₁A number of epochs;

(b) freezing module A, fine tuning module C n₁A number of epochs;

(c) training m with the A module and the C module as a whole₂A number of epochs;

(d) freezing A and C modules, fine tuning module n₂A number of epochs;

the training process of the model is to learn the training set for multiple times by the model, and each learning is called an epoch; m is₁、m₂Training times of one period set for model training; n is₁、n₂Number of training times, m, set for fine-tuning the model₁、m₂、n₁、n₂Are all integers more than or equal to 1.

According to the invention, the classification head module is added behind the main network of the detection model, the main network is shared with the detection model, the classification confidence coefficient is obtained by calculating the feature map of the last layer of the main network through the classification head module, and whether the feature map enters the detection module or not is determined according to the classification result, so that the calculation efficiency of the classification model can be effectively improved. In use, the classification model is directly adopted to carry out secondary classification on the foreground and the background, a better classification effect can be obtained, a main network of the trained detection model is adopted, a good effect can be achieved only by finely adjusting the classification head, the structure is simple, and the feasibility is higher. The unique label coding mode can also provide more supervision information, and the sigmoid activation function is used to enable the neurons not to influence each other, thereby being beneficial to learning of the classification module. Further, if the loss functions of the detection network and the classification head modules are subjected to combined loss, model training is unstable, all the modules are difficult to learn to be optimal at the same time, and the classification head and the detection modules can be optimal in an alternate training mode.

Drawings

Fig. 1 is a schematic structural diagram of a detection model for rapidly filtering a background picture according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a structure of a classification head in a detection model for rapidly filtering a background picture according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions in one or more of the present specification better understood, the technical solutions in one or more embodiments of the present specification are clearly and completely described below, and it is obvious that the described embodiments are only one or more part of the embodiments of the present specification, but not all of the embodiments.

All other embodiments that can be derived by one of ordinary skill in the art from the embodiments in one or more of the specification without inventive faculty are intended to fall within the scope of one or more of the specification.

Example 1: in order to solve the above technical problems, the present embodiment takes a detection model of a contraband detection scene as an example, and explains a detection model for filtering a background picture and a training method thereof.

Fig. 1 is a schematic structural diagram of a detection model for rapidly filtering a background picture according to embodiment 1 of the present invention, where the detection model is improved by using a Cascade RCNN detection model, and the detection model mainly includes a trunk network (Backbone), a classification Head module (Cls Head), a feature fusion module (FPN), a region recommendation network (RPN), a region of interest pooling (Roi), and a Cascade detector (Cascade Head). The main network is used for extracting characteristic information from the input image, and the output end of the main network is connected with the classification head module. The classification head module is used for classifying the feature information extracted by the main network to obtain the positive confidence coefficient of the picture, setting the picture containing the target as positive and the picture without the target as negative, selecting subsequent operation according to the result of the classification head module, if the picture is positive (positive), sending the feature extraction information of the main network into the feature fusion module, and if the picture is negative, directly outputting a detection result (result), namely the picture is negative and the target is not detected; usually the picture is negative and the results of undetected targets are not displayed additionally. The positive confidence of the picture comprises a plurality of confidence results, specifically, if the number of neurons output by the classification head module is set to be the same as the number of categories of the detection task, the number of results is the same as the number of categories, each result represents the confidence of the category that a target may exist, if the confidence is greater than or equal to the threshold, the picture is positive, and if the confidence is not greater than or equal to the threshold, the picture is negative. Distinguishing whether the picture is positive or negative through a positive confidence coefficient, setting a threshold value by drawing a classification pr curve and combining scene requirements, wherein the positive confidence coefficient is positive when being larger than the threshold value, and the positive confidence coefficient is negative when being smaller than the threshold value; for example, the threshold value is set to 0.5, that is, the determination is positive when 0.5 or more is present, and the determination is negative when 0.5 or more is not present. The feature fusion module is used for further fusing features extracted by the main network, obtaining a fused feature map (feature map) after passing through the feature fusion module, and connecting the feature map with the regional suggestion network. The region recommendation network is used for preliminarily filtering the candidate regions to obtain regions of interest, and the regions of interest pooling layer fixes the obtained characteristics of the regions of interest to be the same in size; and finally, further classifying the region of interest in the last step through a cascade detector and regressing to output a final detection result.

The type of the model is not limited in this embodiment, and may be based on a one-stage or two-stage model of the deep learning method.

Preferably, referring to fig. 2, the classification head module is structured as follows, through a layer of 3 × 3 convolution, through a layer of 1 × 1 convolution dimensionality reduction, using adaptive global pooling to enable input images of different sizes to obtain features of the same dimensionality, and finally through a layer of full connection layer, outputting, wherein an activation function used by an output layer is sigmoid, the number of output neurons is equal to the number of target classes, and a loss function during classification adopts cross entropy loss.

Further preferably, the label coding method during the classification header training is as follows: and setting the number of the neurons output by the classification head to be the same as the number of the classes of the detection tasks, wherein if a certain class of target exists in the detected picture, the label of the corresponding neuron is 1, and otherwise, the label is 0. As an example, in the scenario of contraband detection, the detection tasks may be classified into: if the knife exists in the detected picture, namely the neuron label corresponding to the knife is 1; if no tool is present, the corresponding neuron label is 0.

In the prior art, a classification network is connected in series in front of a detection model, and a picture meeting conditions is detected after passing through the classification network. In the embodiment of the invention, the classification head module is added behind the main network of the detection model, the main network is shared with the detection model, the classification confidence coefficient is obtained by calculating the feature map of the last layer of the main network through the classification head, and whether the feature map enters the detection module or not is determined according to the classification result, so that the calculation efficiency of the classification model can be effectively improved. In use, the classification model is directly adopted to carry out secondary classification on the foreground and the background, a better classification effect can be obtained, a main network of the trained detection model is adopted, a good effect can be achieved only by finely adjusting the classification head, the method is simpler, and the feasibility is higher.

Example 2: according to another technical solution of an embodiment of the present invention, a model training method is provided for training the detection model described in embodiment 1, where a backbone network is set as a module a, a classification head module is set as a module B, the model structure residual module (including a combination of FPN, RPN, Roi posing, and Cascade head) described in embodiment 1 is set as a module C, which may also be referred to as a detection module, in an embodiment, a training set is established to train the model, and an alternate training mode is adopted during model training:

(a) firstly, training 12 epochs by the A module and the B module as a whole;

(b) freezing the module A, and finely adjusting the module C for 3 epochs;

(c) training 12 epochs by taking the module A and the module C as a whole;

(d) freezing the A module and the C module, and finely adjusting the B module for 3 epochs.

The training process of the model is to learn the training set for multiple times by the model, and each learning is called an epoch; 12 is the training times artificially selected when the model is trained, and usually 12 epochs are selected as the iteration of one period; 3 is the number of training times manually set for fine-tuning the model, and usually about 3 epochs are selected.

After the 4 steps, the final model is obtained, and after verification, the pre-training model obtained through the steps (a) and (b) is found to be superior to the pre-training model on the Imagenet through experiments, and has mobility on the same type of data (such as x-ray pictures), namely, the effect of the pre-training model serving as the pre-training model after data replacement is superior to that of the pre-training model on the Imagenet.

After the training of the model is completed, the test method comprises the following steps: the pictures pass through a classification head after the features of the pictures are extracted through a backbone network, the classification head outputs results with the same number as that of target classes, if the results are smaller than a threshold value, the results are judged to be background, the detection results are directly returned to be empty, otherwise, the results are calculated through a detection model, and the actual threshold value is determined according to the drawing classification pr curve and scene requirement setting.

The technical scheme of the invention can also be applied to target identification detection scenes except contraband detection in the embodiment, such as various target detection scenes, such as face identification, license plate identification, road identification, unmanned driving, focus detection analysis under medical image CT inspection scenes and the like.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. The utility model provides a detection model of quick filtering background picture which characterized in that:

the system comprises a backbone network, a classification head module and a detection module; the main network is used for extracting characteristic information from the input image, and the output end of the main network is connected with the classification head module; the classification head module is used for classifying the feature information extracted by the main network so as to obtain whether an input image contains a target or not, if so, sending the feature extraction information of the main network into the detection module, and if not, directly outputting a detection result;

setting pictures containing targets as positive pictures and pictures containing no targets as negative pictures; setting the number of neurons output by the classification head to be the same as the number of classes of the detection task, wherein the number of results is the same as the number of the classes, each result represents the confidence coefficient of a target possibly existing in the class, if the confidence coefficient is larger than or equal to the threshold value, the picture is positive, and if the confidence coefficient is not larger than or equal to the threshold value, the picture is negative, wherein the threshold value is set by drawing a classification pr curve and combining with the scene requirement;

the model is obtained by the following training method:

(a) training the backbone network and the classification head module as a whole m₁Secondly;

(b) fine tuning detection module n₁Secondly;

(c) training the backbone network and the detection module as a whole m₂Secondly;

(d) fine-tuning sorting head module n₂Secondly;

wherein m is₁、m₂Training times of one period set for model training; n is₁、n₂Number of training times, m, set for fine-tuning the model₁、m₂、n₁、n₂Are all integers more than or equal to 1.

2. The detection model of claim 1, wherein the detection module comprises a feature fusion module, a region recommendation network, region of interest pooling, and a cascade detector.

3. The detection model for rapidly filtering the background picture as claimed in claim 2, wherein the feature fusion module is configured to further fuse features extracted by the backbone network, and obtain a fused feature map after passing through the feature fusion module, and the feature map is connected to the regional recommendation network.

4. The detection model of claim 2, wherein the region recommendation network is configured to preliminarily filter the candidate regions to obtain regions of interest, and the region of interest pooling layer fixes the obtained features of the regions of interest to the same size.

5. The detection model as claimed in claim 2, wherein the cascade detector further classifies and regresses the region of interest to output a final detection result.

6. The detection model of claim 1, wherein the classification head module is configured to perform a layer of 3 x 3 convolution, perform a layer of 1 x 1 convolution to reduce dimensions, use adaptive global pooling to obtain features of the same dimension for input images of different sizes, and output the input images after passing through a layer of full-link layer.

7. The detection model of claim 1, wherein the label coding mode during the training of the classification head module is as follows: the number of the neurons output by the classification head is set to be the same as the number of the classes of the detection tasks, if a certain class of target exists in the input picture, the label of the corresponding neuron is 1, and if the target does not exist, the label is 0.

8. A model training method is used for training a detection model according to any one of claims 1 to 7, a backbone network is set as a module A, a classification head module is set as a module B, a feature fusion module, a region recommendation network and region-of-interest pooling are set as a module C, and the following alternative training mode is adopted during model training:

(a) training module A and module B as a whole m₁Secondly;

(b) trimming C Module n₁Secondly;

(c) training m with the A module and the C module as a whole₂Secondly;

(d) trimming B Module n₂Secondly;