CN116342955A

CN116342955A - Target detection method and system based on improved feature pyramid network

Info

Publication number: CN116342955A
Application number: CN202310324705.8A
Authority: CN
Inventors: 李天平; 韩宇; 丁同贺; 李冠兴; 崔朝童; 李萌
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-06-27

Abstract

The invention provides a target detection method and a target detection system based on an improved feature pyramid network, wherein the method comprises the following steps: acquiring a target image to be detected and identified; inputting the target image into a target recognition detection model to obtain a target detection result; the target recognition detection model comprises a feature extraction part, a fusion part and a detection head part; the fusion part fuses the extracted high-level feature map with strong semantic information with the low-level feature map with rich position information, so that not only can the more refined features be further extracted, but also the detection of small objects is facilitated.

Description

Target detection method and system based on improved feature pyramid network

Technical Field

The invention belongs to the technical field of target identification, and particularly relates to a target detection method and system based on an improved feature pyramid network.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

To improve the accuracy of single stage object detectors, most object detectors use a Feature Pyramid Network (FPN) architecture to optimize different levels of features. The feature pyramid network distributes objects of different sizes to different element layers for detection. As the network goes deeper, the semantic information of the feature map is more abundant, and the position information of the shallow features is more abundant. In order to enable the lower-layer features to obtain more semantic information, a feature pyramid network provides a top-down structure, and the feature map is transferred from a high layer to a low layer so as to enhance the semantic information of the shallow-layer feature map. The feature pyramid network relieves the difficulty of multi-scale detection and obtains better object features.

Object detection is an important but challenging task in computer vision that requires classification and localization of objects in digital images. It forms the basis for many visual applications including instance segmentation, object tracking and autopilot. All CNN-based object detectors can be divided into two categories: anchor-based and anchor-free probes. The former consists of a two-stage process and a single-stage process. Compared with the anchor-based method, the anchor-free method gives up an anchor mechanism, reduces a plurality of super parameters of an anchor box, including size, ratio, quantity and improvement of performance of a detector

A full convolution single level detector (FCOS) is an excellent anchor-free detector that treats all pixels inside an image as training samples and predicts the four distances of the bounding box directly. It introduces a novel "centrality branch" to suppress low quality detection bounding boxes. FCOS consists of three parts: (1) extracting a backbone network of features from the input image. (2) a detection head for predicting classification and positioning of objects. (3) neck is used to gather features from different stages.

The structure of the feature pyramid network is not optimal, which has two drawbacks: 1) A 1 x 1 convolution is used to reduce the number of channels in the feature map. The number of channels of the feature maps C3, C4, and C5 is 512, 1024, 2048, respectively. However, all the channel numbers, when passed to the neck structure, use 1 x 1 convolution to reduce the channel number to 256, so the features on the layer lose much useful information. 2) The top-down structure of the feature pyramid network more stresses features of adjacent layers, thus presenting a barrier to the propagation of features from higher layers to lower layers. Therefore, a detector based on a feature pyramid network cannot obtain optimal results for small objects by using low-level features.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the target detection method and the target detection system based on the improved feature pyramid network, which are used for fusing the extracted high-level feature map with strong semantic information with the low-level feature map with rich position information through the fusion part, so that more refined features can be further extracted, and the detection of small objects is facilitated.

To achieve the above object, a first aspect of the present invention provides an object detection method based on an improved feature pyramid network, including the steps of:

acquiring a target image to be detected and identified;

inputting the target image into a target recognition detection model to obtain a target detection result;

the target recognition detection model comprises a feature extraction part, a fusion part and a detection head part;

the feature extraction part performs feature extraction on the target image to obtain feature images with different scales;

the fusion part stacks the feature graphs with different scales to obtain a feature pyramid;

and the detection head part obtains a detection result of the target image to be detected and identified according to the feature pyramid.

A second aspect of the invention provides an improved feature pyramid network-based object detection system, comprising:

a target image acquisition module: the method comprises the steps of acquiring a target image to be detected and identified;

the target image detection module inputs the target image into a target recognition detection model to obtain a target detection result;

the identification detection model building module: the target recognition detection model comprises a feature extraction part, a fusion part and a detection head part;

A third aspect of the invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method described above.

A fourth aspect of the invention provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method described above.

The one or more of the above technical solutions have the following beneficial effects:

in the invention, the fusion part performs stacking operation on the features extracted by the feature extraction part, and utilizes the thought of multi-scale fusion of the feature pyramid network to fuse the high-level feature map with strong semantic information with the low-level feature map with rich position information, thereby not only further extracting more refined features, but also being more beneficial to the detection of small objects.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a schematic diagram of a network structure of a target recognition detection model according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a target detection process according to the present invention;

FIG. 3 is a schematic diagram of a fusion module according to a first embodiment of the present invention;

FIG. 4 is a schematic diagram of an inverted residual module structure according to a first embodiment of the present invention;

FIG. 5 is a schematic view of a lightweight test head according to an embodiment of the invention;

fig. 6 is a detection result of the object recognition detection model network on the COCO2017 test set in the first embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

The embodiment discloses a target detection method based on an improved feature pyramid network, which comprises the following steps:

acquiring a target image to be detected and identified;

As shown in fig. 1, in the present embodiment, an anchor-free detection method using FCOS as a baseline is proposed for detecting an object, and an object recognition detection model is established including a feature extraction section, a fusion section, and a detection head section.

Specifically, the feature extraction part uses ResNet-50 as a backbone network to perform feature extraction to obtain feature graphs { C3, C4 and C5} with different scales, and then the feature graphs { N3, N4 and N5} with corresponding sizes are obtained after the feature graphs with different scales are subjected to 3×3 convolution operation respectively, wherein a 3×3 convolution kernel has a larger receptive field, and can extract more refined features. The main function of the 3×3 convolution kernel is to better extract features without changing the feature map size.

As shown in fig. 3, we first obtain { C3, C4, C5} feature maps of different scales extracted through the backbone network, then obtain { N3, N4, N5} by 3×3 convolution, and then scale N3, N4 to N5 of the same size by a downsampling operation. And finally, fusing the three feature maps through concat operation to obtain fusion features. The fusion features mainly utilize the thought of multi-scale fusion of a feature pyramid network, and by fusing a high-level feature map with strong semantic information with a low-level feature map with rich position information, not only can the finer features be further extracted, but also the detection of small objects is facilitated.

And performing concat operation on the fusion features and { C3, C4 and C5} which are subjected to 1 multiplied by 1 convolution respectively to serve as input of the FPN, and extracting features through an inversion residual error module in the FPN to generate a P3-P5 feature map. In order to obtain a better semantic information detection large target, the fusion feature is subjected to a 3×3 convolution downsampling operation to obtain P6, and then the P6 is subjected to a 3×3 convolution downsampling operation to obtain P7. Finally, P3-P7 is input into the head as a feature map for detection.

As shown in fig. 4, in this embodiment, an improved feature pyramid network is adopted, the 3×3 convolution inside the original Feature Pyramid (FPN) is replaced by an inverted residual module, and in contrast to the residual convolution of first reducing the dimension and then increasing the dimension, the inverted residual module firstly uses 1×1 convolution to realize the dimension increase, then uses 3×3DW convolution to perform feature extraction, and finally uses 1×1 convolution to realize dimension reduction, so that the inverted residual module not only ensures that the DW convolution can extract more features in high dimension, but also can greatly reduce the calculation amount of the model. The formula is:

P ^out ＝Relu6(PW2(DW(Relu6(PW1(P ⁱⁿ )))))+P ⁱⁿ (1)

wherein reiu6=min (max (x, 0), 6), P ⁱⁿ Representing the inverted residual block input, PW1 represents the 1×1 convolution of the up-dimension, PW2 represents the 1×1 convolution of the down-dimension, and DW represents the depth convolution. P is p ^out Representing the inverted residual block output.

In this embodiment, a lightweight detection head is also designed, so as to reduce complexity of the model and maintain balance between accuracy and speed.

As shown in fig. 5, the detection head is composed of one inverted residual convolution and two branches of dilation convolutions. The feature map output by the FPN is firstly subjected to feature extraction by inversion residual convolution, then features are further extracted by two series-connected 3×3 expansion convolutions (expansion ratio=2), and the receptive field of the spatial image is increased by the characteristics of the expansion convolutions, so that a large range of context information can be well encoded on different scales, and finally three prediction values, namely Classification prediction, center-less prediction and Regression prediction, are output.

For Classification branches, 80 score parameters (COCO dataset class number 80) are predicted at each location in the prediction feature map.

For the Regression and Center-less branches, 4 distance parameters and 1 parameter are predicted at each position of the prediction feature map, respectively. In the subsequent screening of the network for high quality bbox, the predicted class score is multiplied by Center-less at the root, and then the bbox is ranked according to the result obtained. Only the bbox with higher score is retained, the purpose of which is to screen out those bboxes with low target class score and predicted points and farther from the target center, and finally the high quality bboxes are retained.

The target recognition detection model proposed in this embodiment is trained and evaluated on the COCO2017 dataset, and compared with other methods to prove the superiority of the model of this embodiment, and the AP is taken as an evaluation index of the target detection accuracy. The test results are shown in table 1, and the method of the model of this example achieves the best detection performance. Fig. 6 shows the detection results of the model of this embodiment on the COCO2017 test set.

Table 1:

Method	backbone network	Input resolution	Params(M)	AP(％)
					Fast R-CNN	VGG-16	600×1000	-	19.3
Faster R-CNN	VGG-16	600×1000	134.7	21.9
					RetinaNet	MobilNetV2	800×1024	11.3	30.8
SSD300	VGG-16	300×300	26.3	23.2
					SSD512	VGG-16	512×512	29.4	26.8
YOLOV2	DarkNet-19	544×544	51.0	21.2
					EfficientDet-D0	Efficient-B0	512×512	-	33.8
YOLOV3	DarkNet-53	416×416	-	33.0
					ours	ResNet-50	512×512	36.4	35.6

Example two

It is an object of this embodiment to provide an improved feature pyramid network based object detection system comprising:

Example III

It is an object of the present embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the steps of the method described above when executing the program.

Example IV

An object of the present embodiment is to provide a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

The steps involved in the devices of the second, third and fourth embodiments correspond to those of the first embodiment of the method, and the detailed description of the embodiments can be found in the related description section of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media including one or more sets of instructions; it should also be understood to include any medium capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one of the methods of the present invention.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. An improved feature pyramid network-based target detection method is characterized by comprising the following steps:

acquiring a target image to be detected and identified;

2. The method for detecting the target based on the improved feature pyramid network as claimed in claim 1, wherein the feature extraction part adopts a ResNet-50 backbone network, the ResNet-50 backbone network performs feature extraction on the target image to be detected to obtain feature graphs { C3, C4, C5} of different scales of the target image, and the obtained feature graphs of different sizes are respectively subjected to 3×3 convolution operation to obtain feature graphs of corresponding sizes { N3, N4, N5}.

3. The object detection method based on the improved feature pyramid network as claimed in claim 2, wherein the merging section scales the feature maps N3, N4 to the same size as the feature map N5 through a downsampling operation, and merges the feature maps N3, N4 scaled with the feature map N5 through a concat operation.

4. The object detection method based on the improved feature pyramid network as claimed in claim 2, wherein the outputs of the fusion part are subjected to a concat operation with { C3, C4, C5} convolved by 1 x 1, respectively, and then used as the input of the improved feature pyramid network, and the improved feature pyramid is specifically: the method comprises the steps of replacing 3*3 convolution in an original feature pyramid network with an inverted residual error module, performing dimension-lifting operation on input of the inverted residual error module by using 1X 1 convolution, performing feature extraction by using 3X 3DW convolution, and performing dimension-reducing operation by using 1X 1 convolution.

5. The method for detecting an object based on an improved feature pyramid network as claimed in claim 4, wherein the output of the fusion part is subjected to a 3 x 3 convolution downsampling operation to obtain a feature map P6, the feature map P6 is subjected to a 3 x 3 convolution downsampling operation to obtain a feature map P7, and the outputs of the feature maps P6, P7 and the inverted residual module are detected as inputs of the detection head part.

6. The method for detecting targets based on an improved feature pyramid network as claimed in claim 4, wherein said detection head portion is composed of an inverted residual convolution and two branches of dilation convolutions, and the output of said inverted residual module is subjected to feature extraction by the inverted residual convolution, and then subjected to further feature extraction by two serially connected 3 x 3 dilation convolutions, respectively, to output a predicted value.

7. An object detection system based on an improved feature pyramid network, characterized in that,

8. The target detection system based on the improved feature pyramid network as claimed in claim 5, wherein in the recognition detection model construction module, the feature extraction part adopts a ResNet-50 backbone network, the ResNet-50 backbone network performs feature extraction on a target image to be detected to obtain feature graphs { C3, C4, C5} of different scales of the target image, and the obtained feature graphs of different sizes are respectively subjected to 3×3 convolution operation to obtain feature graphs of corresponding sizes { N3, N4, N5}, respectively.

9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of a method for object detection based on an improved feature pyramid network as claimed in any one of claims 1-6.

10. A processing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of a method for object detection based on an improved feature pyramid network as claimed in any one of claims 1-6 when said program is executed.