CN114694003A

CN114694003A - Multi-scale feature fusion method based on target detection

Info

Publication number: CN114694003A
Application number: CN202210303602.9A
Authority: CN
Inventors: 王改华; 甘鑫; 曹清程; 翟乾宇; 刘洪�
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-07-01
Anticipated expiration: 2042-03-24
Also published as: CN114694003B

Abstract

The invention belongs to the field of target detection, and relates to a multi-scale feature fusion method based on target detection, which comprises the following steps: 1) carrying out convolution processing on the feature maps with different sizes to obtain feature maps with the same channel number; 2) respectively carrying out channel dimension fusion processing and space dimension fusion processing on the feature map obtained in the step 1) to respectively obtain a channel dimension fusion processing feature map and a space dimension fusion processing feature map; 3) fusing the channel dimension fusion processing characteristic diagram and the space dimension fusion processing characteristic diagram obtained in the step 2), so as to realize the characteristic fusion of the characteristic diagrams with different sizes in two dimensions of space and channel. The invention provides a multi-scale feature fusion method based on target detection, which can obviously improve the detection precision.

Description

A multi-scale feature fusion method based on target detection

技术领域technical field

本发明属于目标检测领域，涉及一种特征融合方法，尤其涉及一种基于目标检测的多尺度特征融合方法。The invention belongs to the field of target detection, and relates to a feature fusion method, in particular to a multi-scale feature fusion method based on target detection.

背景技术Background technique

目标检测的主要任务是从输入图像中定位感兴趣的目标，然后准确地判断每个感兴趣目标的类别。当前的目标检测技术已经广泛地应用于日常生活安全、机器人导航、智能视频监控、交通场景检测及航天航空等领域。The main task of object detection is to locate the objects of interest from the input image, and then accurately determine the category of each object of interest. The current target detection technology has been widely used in daily life safety, robot navigation, intelligent video surveillance, traffic scene detection, aerospace and other fields.

在深度学习的发展背景下，卷积神经网络已经得到越来越多的人认同，应用也越来越普遍。基于深度学习的目标检测算法利用卷积神经网络(CNN)自动选取特征，然后再将特征输入到检测器中对目标分类和定位。In the context of the development of deep learning, convolutional neural networks have been recognized by more and more people, and their applications have become more and more common. Object detection algorithms based on deep learning use convolutional neural networks (CNN) to automatically select features, and then input the features into the detector to classify and locate objects.

在一张图片中，物体的尺寸大小是不同的。如果都采用相同尺寸的特征图训练，则效果很不理想，如果将特征图裁剪成不同的尺寸分多次输入训练，则大大地增加了网络的复杂度和训练时间。在目标检测算法中，为了解决对不同尺寸物体的预测问题，Lin等人提出了著名的特征金字塔网络FPN，如图1所示，其基本思想在于结合浅层特征图的细粒度空间信息和深层特征图的语义信息对多尺度的目标进行检测。在此基础上，又有许多人提出了改进的FPN结构。Liu等人提出了PANet，该网络先采用上采样融合不同尺寸的特征图，在此基础上再进行下采样特征融合，重新构建了一个强化了空间信息的金字塔；GolnazGhaisi等人提出了NAS-FPN，该网络在一个覆盖所有交叉尺度连接的可扩展搜索空间中，采用神经网络结构搜索，可以跨范围地融合特性；Guo等人提出了AugFPN，该网络针对FPN存在的缺陷采用了Consistent Supervision，Residual Feature Augmentation和Soft RoISelection；Tan等人提出了BiFPN架构，BiFPN同样对特征进行加权融合，让网络能学习到不同输入特征的重要性；Siyuan Qiao等人提出了Recursive-FPN，该网络将传统FPN的融合后的特征图再输入给Backbone，进行二次循环。金字塔网络FPN可以平衡目标的分类准确性和定位准确性，有效地解决多尺度预测问题。但是该结构只是在空间结构上对特征进行有效融合，不同的通道之间的信息可能存在关联或者冗余。In a picture, the size of the objects is different. If the feature maps of the same size are used for training, the effect is very unsatisfactory. If the feature maps are cropped into different sizes and input for training multiple times, the complexity and training time of the network will be greatly increased. In the target detection algorithm, in order to solve the problem of predicting objects of different sizes, Lin et al. proposed the famous feature pyramid network FPN, as shown in Figure 1. The basic idea is to combine the fine-grained spatial information of the shallow feature map and the deep layer. The semantic information of the feature map is used to detect objects at multiple scales. On this basis, many people have proposed an improved FPN structure. Liu et al. proposed PANet, which first uses up-sampling to fuse feature maps of different sizes, and then performs down-sampling feature fusion on this basis to reconstruct a pyramid that enhances spatial information; GolnazGhaisi et al. proposed NAS-FPN , the network uses a neural network structure search in an extensible search space covering all cross-scale connections, and can fuse features across ranges; Guo et al. proposed AugFPN, which used Consistent Supervision, Residual Supervision, Residual for the shortcomings of FPN. Feature Augmentation and Soft RoISelection; Tan et al. proposed the BiFPN architecture. BiFPN also weighted and fused features, allowing the network to learn the importance of different input features; Siyuan Qiao et al. proposed Recursive-FPN, which combines traditional FPN The fused feature map is then input to Backbone for a second cycle. The pyramid network FPN can balance the classification accuracy and localization accuracy of objects, and effectively solve the multi-scale prediction problem. However, this structure only effectively fuses the features in the spatial structure, and the information between different channels may be correlated or redundant.

发明内容SUMMARY OF THE INVENTION

为了解决背景技术中存在的上述技术问题，本发明提供了一种可显著提升检测精度的基于目标检测的多尺度特征融合方法。In order to solve the above technical problems existing in the background art, the present invention provides a multi-scale feature fusion method based on target detection that can significantly improve detection accuracy.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于目标检测多尺度特征融合方法，其特征在于：所述基于目标检测多尺度特征融合方法包括以下步骤：A multi-scale feature fusion method based on target detection, characterized in that: the target detection-based multi-scale feature fusion method comprises the following steps:

1)将不同尺寸的特征图进行卷积处理后得到通道数相同的特征图；1) Convolve feature maps of different sizes to obtain feature maps with the same number of channels;

2)将步骤1)获取得到的特征图分别进行通道维度融合处理以及空间维度融合处理，分别得到通道维度融合处理特征图以及空间维度融合处理特征图；2) Perform channel dimension fusion processing and spatial dimension fusion processing on the feature maps obtained in step 1), respectively, to obtain channel dimension fusion processing feature maps and spatial dimension fusion processing feature maps respectively;

3)将步骤2)所获取得到的通道维度融合处理特征图以及空间维度融合处理特征图进行融合处理，实现对不同尺寸的特征图在空间和通道两个维度的特征融合。3) Perform fusion processing on the channel dimension fusion processing feature map and the spatial dimension fusion processing feature map obtained in step 2), so as to realize the feature fusion of the feature maps of different sizes in the two dimensions of space and channel.

作为优选，本发明所采用的步骤1)的具体实现方式是：As preferably, the concrete implementation mode of step 1) adopted by the present invention is:

1.1)获取不同尺寸的特征图，所述不同尺寸的特征图最少包括特征图C3、特征图C4以及特征图C5；其中特征图C3的尺寸是特征图C4的2倍；特征图C4的尺寸是特征图C5的2倍；1.1) Obtain feature maps of different sizes, which at least include feature map C3, feature map C4 and feature map C5; the size of feature map C3 is twice that of feature map C4; the size of feature map C4 is 2 times the feature map C5;

1.2)对不同尺寸的特征图进行经过卷积处理，使得卷积处理后的通道数相同。1.2) Perform convolution processing on feature maps of different sizes, so that the number of channels after convolution processing is the same.

作为优选，本发明所采用的步骤2)中通道维度融合处理的具体实现方式是：将特征图在通带维度上压缩为三维向量，再将不同尺度的特征图在通道维度融合。Preferably, the specific implementation method of the channel dimension fusion processing in step 2) adopted in the present invention is: compress the feature map into a three-dimensional vector in the passband dimension, and then fuse the feature maps of different scales in the channel dimension.

作为优选，本发明将特征图在通带维度上压缩为三维向量的具体实现方式是：Preferably, the present invention compresses the feature map into a three-dimensional vector in the passband dimension as follows:

将特征图经过unfold操作，促使特征图由四维tensor转换成三维tensor；所述四维tensor的维度是【B，C，H，W】，所述三维tensor的维度为【B，C’，L】；The feature map is subjected to the unfold operation to convert the feature map from a four-dimensional tensor to a three-dimensional tensor; the dimension of the four-dimensional tensor is [B, C, H, W], and the dimension of the three-dimensional tensor is [B, C', L] ;

其中：in:

C’满足以下关系：C' satisfies the following relationship:

C′＝C*K*KC'=C*K*K

其中：in:

C是原特征图的通道大小；C is the channel size of the original feature map;

K是卷积核的大小；K is the size of the convolution kernel;

C’是滑动窗口大小；C' is the sliding window size;

L满足以下关系：L satisfies the following relationship:

L＝H'*W'L=H'*W'

其中：in:

H是原特征图的长度；H is the length of the original feature map;

W是原特征图的宽度；W is the width of the original feature map;

pading是填充大小；pading is the padding size;

K是卷积核大小；K is the size of the convolution kernel;

stride是步距大小；stride is the step size;

L是滑动窗口的数量。L is the number of sliding windows.

作为优选，本发明将不同尺度的特征图在通道维度融合的具体实现方式是：Preferably, the specific implementation of the present invention to fuse feature maps of different scales in the channel dimension is as follows:

a)特征图C3经过1次unfold操作得到C’3；特征图C4经过2次unfold操作后得到C’4_1以及C’4_2；特征图C5经过1次unfold操作得到C’5；其中：C’3和C’4_1大小相等，C’4_2和C’5大小相等；a) Feature map C3 obtains C'3 after one unfold operation; feature map C4 obtains C'4_1 and C'4_2 after two unfold operations; feature map C5 obtains C'5 after one unfold operation; among which: C' 3 and C'4_1 are equal in size, C'4_2 and C'5 are equal in size;

b)将C’3和C’4_1融合得到三维tensor、将C’4_2和C’5进行融合得到三维tensor；b) fuse C'3 and C'4_1 to obtain a three-dimensional tensor, and fuse C'4_2 and C'5 to obtain a three-dimensional tensor;

c)再将步骤b)的结果通过reshape恢复到4个四维tensor，其中，第一个四维tensor的尺寸和特征图C3的尺寸相同，第2个四维tensor的尺寸以及第3个四维tensor的尺寸分别与特征图C4的尺寸相同，第4个四维tensor的尺寸和特征图C5的尺寸相同；c) Restore the result of step b) to four four-dimensional tensors through reshape, wherein the size of the first four-dimensional tensor is the same as the size of the feature map C3, the size of the second four-dimensional tensor and the size of the third four-dimensional tensor They are the same size as feature map C4, and the size of the fourth four-dimensional tensor is the same as that of feature map C5;

d)第一个四维tensor得到经过通道维度上融合的特征图P3_1；第4个四维tensor得到经过通道维度上融合的特征图P5_1；将第2个四维tensor和第3个四维tensor相加得到经过通道维度上融合的特征图P4_1；所述特征图P3_1的尺寸、特征图P4_1的尺寸以及特征图P5_1的尺寸均是非等同的。d) The first four-dimensional tensor obtains the feature map P3_1 fused on the channel dimension; the fourth four-dimensional tensor obtains the feature map P5_1 fused on the channel dimension; the second four-dimensional tensor and the third four-dimensional tensor are added to obtain the feature map P5_1. The feature map P4_1 fused in the channel dimension; the size of the feature map P3_1, the size of the feature map P4_1, and the size of the feature map P5_1 are all non-equivalent.

作为优选，本发明所采用的步骤2)中空间维度融合处理是对特征图进行上采样，再将不同不同尺度的特征图在空间维度融合。Preferably, the spatial dimension fusion processing in step 2) adopted in the present invention is to upsample the feature maps, and then fuse the feature maps of different scales in the spatial dimension.

作为优选，本发明所采用的步骤2)中空间维度融合处理的具体实现方式是：As preferably, the concrete implementation mode of spatial dimension fusion processing in step 2) adopted by the present invention is:

将特征图C5经过上采样与特征图C4融合得到P4_2，将P4_2再经过上采样与特征图C3融合得到P3_2，最终得到经过空间维度上融合的特征图P3_2、特征图P4_2以及特征图P5_2；所述特征图P3_2的尺寸、特征图P4_2的尺寸以及特征图P5_2的尺寸是非等同的。The feature map C5 is upsampled and the feature map C4 is fused to obtain P4_2, and the P4_2 is then upsampled and the feature map C3 is fused to obtain P3_2, and finally the feature map P3_2, feature map P4_2 and feature map P5_2 fused in the spatial dimension are obtained; The size of the feature map P3_2, the size of the feature map P4_2, and the size of the feature map P5_2 are not equivalent.

作为优选，本发明所采用的步骤3)中的融合处理的具体实现方式是：As preferably, the specific implementation mode of the fusion processing in step 3) adopted by the present invention is:

3.1)对通道维度和空间维度的相同尺寸的特征图进行整合；3.1) Integrate feature maps of the same size in channel dimension and spatial dimension;

3.2)将步骤3.1)整合后的结果进行消融和降采样操作，最终实现对不同尺寸的特征图在空间和通道两个维度的特征融合。3.2) Perform ablation and down-sampling operations on the integrated results of step 3.1), and finally realize feature fusion of feature maps of different sizes in the two dimensions of space and channel.

本发明的优点是：The advantages of the present invention are:

本发明提供了一种基于目标检测的多尺度特征融合方法，该方法首先将不同尺寸的特征图进行卷积处理后得到通道数相同的特征图，其次，对获取得到的特征图分别进行通道维度融合处理以及空间维度融合处理，分别得到通道维度融合处理特征图以及空间维度融合处理特征图，即，采用了两条分支，第一条分支采用压缩操作，将特征图在通带维度上压缩为三维向量，然后再将不同尺度的特征图在通道维度融合；第二条分支采用上采样操作，然后再将不同尺度的特征图在空间维度融合；最后，将获取得到的通道维度融合处理特征图以及空间维度融合处理特征图进行融合处理，实现对不同尺寸的特征图在空间和通道两个维度的特征融合。本发明所提供的基于目标检测的多尺度特征融合方法在FPN的基础上增加了一条支路，该支路将所有的通道信息压缩在一起并进行语义信息融合，最终获得丰富的语义空间信息。通过实验对比发现，本发明构建的网络结构在不同的网络上均取得一定的提升，尤其是中等目标和大目标的检测精度均得到了显著提升。The invention provides a multi-scale feature fusion method based on target detection. The method firstly convolves feature maps of different sizes to obtain feature maps with the same number of channels, and secondly, the obtained feature maps are respectively channel dimension. Fusion processing and spatial dimension fusion processing, respectively, obtain the channel dimension fusion processing feature map and the spatial dimension fusion processing feature map, that is, two branches are used, and the first branch adopts the compression operation to compress the feature map in the passband dimension as three-dimensional vector, and then fuse the feature maps of different scales in the channel dimension; the second branch adopts the upsampling operation, and then fuses the feature maps of different scales in the spatial dimension; finally, fuse the obtained channel dimension to process the feature map And spatial dimension fusion processing feature map for fusion processing, to achieve feature fusion of feature maps of different sizes in the two dimensions of space and channel. The multi-scale feature fusion method based on target detection provided by the present invention adds a branch on the basis of FPN, the branch compresses all channel information together and fuses semantic information, and finally obtains rich semantic space information. Through experimental comparison, it is found that the network structure constructed by the present invention achieves a certain improvement on different networks, especially the detection accuracy of medium targets and large targets is significantly improved.

附图说明Description of drawings

图1是现有技术中特征金字塔网络FPN的结构示意图；1 is a schematic structural diagram of a feature pyramid network FPN in the prior art;

图2是本发明所提供的MFPN的结构示意图；Fig. 2 is the structural representation of MFPN provided by the present invention;

图3是基于本发明所提供的MFPN形成的目标检测网络的总体结构示意图；3 is a schematic diagram of the overall structure of a target detection network formed based on the MFPN provided by the present invention;

图4至图7是基于不同目标检测网络针对不同检测对象的检测效果图。4 to 7 are detection effect diagrams for different detection objects based on different target detection networks.

具体实施方式Detailed ways

参见图2，本发明提供了一种基于目标检测多尺度特征融合方法，该方法包括：Referring to Fig. 2, the present invention provides a multi-scale feature fusion method based on target detection, the method includes:

1.1)获取不同尺寸的特征图，不同尺寸的特征图最少包括特征图C3、特征图C4以及特征图C5；其中特征图C3的尺寸是特征图C4的2倍；特征图C4的尺寸是特征图C5的2倍；1.1) Obtain feature maps of different sizes. Feature maps of different sizes include feature map C3, feature map C4 and feature map C5 at least; the size of feature map C3 is twice that of feature map C4; the size of feature map C4 is the feature map 2 times of C5;

其中：in:

通道维度融合处理的具体实现方式是：将特征图在通带维度上压缩为三维向量，再将不同尺度的特征图在通道维度融合。将特征图在通带维度上压缩为三维向量的具体实现方式是：The specific implementation method of channel dimension fusion processing is to compress the feature map into a three-dimensional vector in the passband dimension, and then fuse the feature maps of different scales in the channel dimension. The specific implementation of compressing the feature map into a three-dimensional vector in the passband dimension is:

将特征图经过unfold操作，促使特征图由四维tensor转换成三维tensor；四维tensor的维度是【B，C，H，W】，三维tensor的维度为【B，C’，L】；The feature map is subjected to the unfold operation to convert the feature map from a four-dimensional tensor to a three-dimensional tensor; the dimension of the four-dimensional tensor is [B, C, H, W], and the dimension of the three-dimensional tensor is [B, C', L];

其中：in:

C’满足以下关系：C' satisfies the following relationship:

C′＝C*K*KC'=C*K*K

其中：in:

K是卷积核的大小；K is the size of the convolution kernel;

C’是滑动窗口大小；C' is the sliding window size;

L满足以下关系：L satisfies the following relationship:

L＝H'*W'L=H'*W'

其中：in:

H是原特征图的长度；H is the length of the original feature map;

W是原特征图的宽度；W is the width of the original feature map;

pading是填充大小；pading is the padding size;

K是卷积核大小；K is the size of the convolution kernel;

stride是步距大小；stride is the step size;

L是滑动窗口的数量。L is the number of sliding windows.

将不同尺度的特征图在通道维度融合的具体实现方式是：The specific implementation of fusing feature maps of different scales in the channel dimension is as follows:

假设，C5的大小为四维tensor【b，c，h，w】，由步骤1.1)和步骤1.2)的介绍可知，C4的大小为【b.c，2*h，2*w】，C3的大小为【b，c，4*h，4*w】。根据步骤1.2)中的公式，经过unfold操作后，C’3的大小为【b，c*k*k，h*w】，C’4_1的大小为【b，c*k*k，h*w】，C’4_2的大小为【b，c*k*k，4*h*w】，C’5的大小为【b，c*k*k，4*h*w】，所以C’3和C’4_1融合后得到三维tensor【b，c*k*k，h*w】；C’4_2和C’3融合后得到三维tensor【b，c*k*k，4*h*w】；Assuming that the size of C5 is a four-dimensional tensor [b, c, h, w], from the introduction of steps 1.1) and 1.2), the size of C4 is [b.c, 2*h, 2*w], and the size of C3 is [b, c, 4*h, 4*w]. According to the formula in step 1.2), after the unfold operation, the size of C'3 is [b, c*k*k, h*w], and the size of C'4_1 is [b, c*k*k, h* w], the size of C'4_2 is [b, c*k*k, 4*h*w], and the size of C'5 is [b, c*k*k, 4*h*w], so C' 3 and C'4_1 are merged to obtain a three-dimensional tensor [b, c*k*k, h*w]; C'4_2 and C'3 are merged to obtain a three-dimensional tensor [b, c*k*k, 4*h*w ];

d)第一个四维tensor得到经过通道维度上融合的特征图P3_1；第4个四维tensor得到经过通道维度上融合的特征图P5_1；将第2个四维tensor和第3个四维tensor相加得到经过通道维度上融合的特征图P4_1；特征图P3_1的尺寸、特征图P4_1的尺寸以及特征图P5_1的尺寸均是非等同的。d) The first four-dimensional tensor obtains the feature map P3_1 fused on the channel dimension; the fourth four-dimensional tensor obtains the feature map P5_1 fused on the channel dimension; the second four-dimensional tensor and the third four-dimensional tensor are added to obtain the feature map P5_1. The feature map P4_1 fused in the channel dimension; the size of the feature map P3_1, the size of the feature map P4_1, and the size of the feature map P5_1 are all non-equivalent.

空间维度融合处理是对特征图进行上采样，再将不同不同尺度的特征图在空间维度融合。The spatial dimension fusion processing is to upsample the feature maps, and then fuse the feature maps of different scales in the spatial dimension.

空间维度融合处理的具体实现方式是：The specific implementation of spatial dimension fusion processing is as follows:

将特征图C5经过上采样与特征图C4融合得到P4_2，将P4_2再经过上采样与特征图C3融合得到P3_2，最终得到经过空间维度上融合的特征图P3_2、特征图P4_2以及特征图P5_2(C5的直接输出)；特征图P3_2的尺寸、特征图P4_2的尺寸以及特征图P5_2的尺寸是非等同的。The feature map C5 is upsampled and the feature map C4 is fused to obtain P4_2, and the P4_2 is then upsampled and the feature map C3 is fused to obtain P3_2, and finally the feature map P3_2, feature map P4_2 and feature map P5_2 (C5) fused in the spatial dimension are obtained. The direct output of ); the size of the feature map P3_2, the size of the feature map P4_2 and the size of the feature map P5_2 are not equivalent.

3)将步骤2)所获取得到的通道维度融合处理特征图以及空间维度融合处理特征图进行融合处理，实现对不同尺寸的特征图在空间和通道两个维度的特征融合，其中融合处理的具体实现方式是：3) Perform fusion processing on the channel dimension fusion processing feature map and the spatial dimension fusion processing feature map obtained in step 2), so as to realize the feature fusion of the feature maps of different sizes in the two dimensions of space and channel, wherein the specific fusion processing is performed. The way to do it is:

具体而言，首先获取不同尺寸的特征图【C3，C4，C5】(C3的尺寸是C4的2倍，C4的尺寸是C5的2倍)经过卷积处理后通道数都变成256，然后再经过两条并行分支branch1和branch2在不同维度进行特征融合，分别得到三个不同尺寸大小的特征图【P3_1，P4_1，P5_1】和【P3_2，P4_2，P5_2】，将不同维度的相同尺寸的特征图得到新的特征图【P3，P4，P5】(P3的尺寸是P4的2倍，P4的尺寸是P5的2倍)，最后经过消融和降采样操作得到5个不同尺寸的特征图【P3，P4，P5，P6，P7】。Specifically, first obtain feature maps of different sizes [C3, C4, C5] (the size of C3 is 2 times that of C4, and the size of C4 is 2 times that of C5) after convolution processing, the number of channels becomes 256, and then Then two parallel branches branch1 and branch2 are used for feature fusion in different dimensions, and three feature maps of different sizes [P3_1, P4_1, P5_1] and [P3_2, P4_2, P5_2] are obtained respectively, and the features of the same size in different dimensions are obtained. Figure to obtain new feature maps [P3, P4, P5] (P3 is twice the size of P4, P4 is twice the size of P5), and finally through ablation and downsampling operations to obtain 5 different sizes of feature maps [P3 , P4, P5, P6, P7].

下面结合附图和实施例对本发明的技术方案作进一步说明。The technical solutions of the present invention will be further described below with reference to the accompanying drawings and embodiments.

本发明是以MFPN网络为基础的优化神经网络，其步骤包括：The present invention is an optimized neural network based on MFPN network, and its steps include:

步骤1：数据输入和预处理。Step 1: Data input and preprocessing.

采用了Ms CoCo2017数据集。COCO数据集共包含80个用于检测的类别。分别为“人”，“自行车”，“汽车”，“摩托车”，“飞机”，“公共汽车”，“火车”，“卡车”，“船”，“交通信号灯”等日常生活中常见个体，它是一个大型的、丰富的物体检测，分割和字幕数据集。包含annotations、test2017、train2017、val2017四个文件。其中train包含118287张图像，val包含5000张图像，test包含28660张图像。annotations为标注类型的集合：objectinstances，object keypoints和image captions，使用JSON文件存储。The Ms CoCo2017 dataset was used. The COCO dataset contains a total of 80 categories for detection. They are "people", "bicycles", "cars", "motorcycles", "airplanes", "buses", "trains", "trucks", "boats", "traffic lights" and other common individuals in daily life , which is a large, rich dataset for object detection, segmentation and captioning. Contains four files: annotations, test2017, train2017, and val2017. where train contains 118287 images, val contains 5000 images, and test contains 28660 images. annotations is a collection of annotation types: objectinstances, object keypoints and image captions, stored using JSON files.

将所有图片的尺寸调整到512x512大小多尺度训练，采用数据增强对图像数据集进行各种操作：随机翻转操作，对不符合要求的图片进行Pad填充操作，对不符合指定大小的图片进行随机裁剪操作，归一化处理操作，图像失真处理操作。Adjust the size of all images to 512x512 for multi-scale training, and use data augmentation to perform various operations on the image dataset: random flip operations, pad filling operations for images that do not meet the requirements, and random cropping for images that do not meet the specified size. operation, normalization processing operation, image distortion processing operation.

步骤2：模型的构建。Step 2: Construction of the model.

如图3所示，基于Vfnet提出了目标检测网络，该目标检测网络由backbone、neck和heads三部分构成。对于不同的目标检测网络，它们通常都采用相同的backbone和neck，但是不同的目标检测网络的heads是不同的。backbone采用的是resnet50，用于提取图片的特征，该网络输出4个不同尺寸的特征图【C2，C3，C4，C5】，步距为【4，8，16，32】，通道大小为【256，512，1024，2048】。Neck结构用于连接backbone和heads，用于融合特征。该结构采用了backbone的三个特征图【C3，C4，C5】，经过1*1卷积后通道都降为256，MFPN特征融合结构，然后对C5进行两次下采样得到C6和C7，最后采用3*3卷积对特征图进行消融处理，输出5个不同尺寸的特征图【P3，P4，P5，P6，P7】，步距为【8，16，32，64，128】，通道大小都为256。Heads用于物体的检测，实现目标的分类和回归。As shown in Figure 3, a target detection network is proposed based on Vfnet, which consists of three parts: backbone, neck and heads. For different object detection networks, they usually use the same backbone and neck, but the heads of different object detection networks are different. The backbone uses resnet50, which is used to extract the features of the picture. The network outputs 4 feature maps of different sizes [C2, C3, C4, C5], the stride is [4, 8, 16, 32], and the channel size is [ 256, 512, 1024, 2048]. The Neck structure is used to connect backbones and heads for fusing features. The structure uses three feature maps [C3, C4, C5] of the backbone. After 1*1 convolution, the channels are reduced to 256, MFPN feature fusion structure, and then C5 is downsampled twice to obtain C6 and C7, and finally Use 3*3 convolution to ablate the feature map, and output 5 feature maps of different sizes [P3, P4, P5, P6, P7], the stride is [8, 16, 32, 64, 128], the channel size Both are 256. Heads are used for object detection, classification and regression of objects.

步骤3：训练测试。Step 3: Train and test.

实验的评价标准采用平均精度(Average-Precision，AP)，AP50，AP75，AP_S，AP_M，AP_L作为主要评价标准。The evaluation standard of the experiment adopts the average precision (Average-Precision, AP), AP50, AP75, AP_S, AP_M, AP_L as the main evaluation standard.

实验环境：搭建以PyTorch1.6.0、torchvision＝0.7.0、CUDA10.0、CUDNN7.4为深度学习框架的Python编译环境，并在平台mmdetection2.6上实现。Experimental environment: Build a Python compilation environment with PyTorch1.6.0, torchvision=0.7.0, CUDA10.0, and CUDNN7.4 as the deep learning framework, and implement it on the platform mmdetection2.6.

Experimental equipment：Experimental equipment:

CPU：Intel Xeon E5-2683 V3@2.00GHz；CPU: Intel Xeon E5-2683 V3@2.00GHz ;

RAM：32GB；RAM: 32GB;

Graphics card：Nvidia GTX 1080Ti；Graphics card: Nvidia GTX 1080Ti;

Hard disk：500GB；Hard disk: 500GB;

测试了多尺度融合结构对物体精度的影响，并在多个网络上进行了对比实验，实验结果如表1所示。The influence of multi-scale fusion structure on object accuracy is tested, and comparative experiments are carried out on multiple networks. The experimental results are shown in Table 1.

表1 MFPN对不同网络的效果Table 1 The effect of MFPN on different networks

MFPN结构在四个网络上都有不同程度的改进。在ATSS网络上，AP从32.7％增加到34％，AP50和AP_L甚至增加了2％。在FCOS的AP从29.1％提高了0.9％，其他指标也可以提高1％以上。Vfnet的AP从34.1％增加了0.4％，但AP_L从50.5％增加到了52.8％。MFPN结构在Foveabox中提升最为明显，其AP提升了2.4％，AP_L从43.8％提升到了47.8％。这说明MFPN模块得到的语义信息比FPN模块得到的语义信息丰富。The MFPN structure is improved to varying degrees on the four networks. On the ATSS network, AP increased from 32.7% to 34%, and AP50 and AP_L even increased by 2%. The AP in FCOS has improved by 0.9% from 29.1%, and other indicators can also improve by more than 1%. The AP of Vfnet increased by 0.4% from 34.1%, but AP_L increased from 50.5% to 52.8%. The MFPN structure has the most obvious improvement in Foveabox, its AP is improved by 2.4%, and AP_L is improved from 43.8% to 47.8%. This shows that the semantic information obtained by the MFPN module is richer than that obtained by the FPN module.

MFPN对不同的网络有不同的影响。它在VFNet上的AP增加了0.4％，在Foveabox上的AP_L增加了2.4％。认为这与基准网络有关。VFNet的原始网络AP高达34.1％，而Foveabox的AP只有28.5％。MFPN对小物体的改进有限，但在中型物体和大物体的检测精度上有显着的提升。在物体检测领域，为了提高小物体的检测精度，需要更大尺寸的特征图。MFPN结构的branch1虽然在通道维度融合了特征图，但是特征图的滑动窗口是C*K*K，它的特征图尺寸在空间上是K*K，所以MFPN对于提高小物体不是很理想。MFPN has different effects on different networks. It increases AP by 0.4% on VFNet and 2.4% in AP_L on Foveabox. Think it has something to do with the benchmark network. The original network AP of VFNet is as high as 34.1%, while that of Foveabox is only 28.5%. MFPN has limited improvement for small objects, but has significant improvements in detection accuracy for medium and large objects. In the field of object detection, in order to improve the detection accuracy of small objects, larger size feature maps are required. Although branch1 of the MFPN structure integrates feature maps in the channel dimension, the sliding window of the feature map is C*K*K, and its feature map size is K*K in space, so MFPN is not ideal for improving small objects.

本发明测试了MFPN和其他网络的检测效果，其结果如图4至图7所示。其中(a)均是fcos的检测效果图，(b)均是atss的检测效果图，(c)均是foveabox的检测效果图，(d)均是本发明所提供的MFPN网络的检测效果图。在图4的汽车检测中，可以看出对于密集物体，本发明提供的MFPN网络能检测到的对象更多。在图5对单个狗的检测中，所有网络都能准确检测到目标，但是本发明所提供的MFPN网络分类精度更高，达到96％。在图6的多物体检测中，本发明提供的MFPN网络不仅能较好地检测出人与自行车，而且还能检测到半遮挡物。在图7的图形中，本发明所提供的MFPN网络对人与马的检测能更好地分离，并且检测精度也达到最高。The present invention tests the detection effects of MFPN and other networks, and the results are shown in Figures 4 to 7 . Wherein (a) is the detection effect diagram of fcos, (b) is the detection effect diagram of atss, (c) is the detection effect diagram of foveabox, (d) is the detection effect diagram of the MFPN network provided by the present invention . In the vehicle detection in Figure 4, it can be seen that for dense objects, the MFPN network provided by the present invention can detect more objects. In the detection of a single dog in Figure 5, all networks can accurately detect the target, but the MFPN network provided by the present invention has a higher classification accuracy, reaching 96%. In the multi-object detection in FIG. 6 , the MFPN network provided by the present invention can not only detect people and bicycles well, but also detect half-occluded objects. In the graph of FIG. 7 , the MFPN network provided by the present invention can better separate the detection of humans and horses, and the detection accuracy also reaches the highest level.

Claims

1. a multi-scale feature fusion method based on target detection, is characterized in that: the described multi-scale feature fusion method based on target detection comprises the following steps:

1) Convolve feature maps of different sizes to obtain feature maps with the same number of channels;

2) Perform channel dimension fusion processing and spatial dimension fusion processing on the feature maps obtained in step 1), respectively, to obtain channel dimension fusion processing feature maps and spatial dimension fusion processing feature maps respectively;

3) Perform fusion processing on the channel dimension fusion processing feature map and the spatial dimension fusion processing feature map obtained in step 2), so as to realize the feature fusion of the feature maps of different sizes in the two dimensions of space and channel.

2. the multi-scale feature fusion method based on target detection according to claim 1, is characterized in that: the concrete implementation mode of described step 1) is:

1.1) Obtain feature maps of different sizes, which at least include feature map C3, feature map C4 and feature map C5; the size of feature map C3 is twice that of feature map C4; the size of feature map C4 is 2 times the feature map C5;

1.2) Perform convolution processing on feature maps of different sizes, so that the number of channels after convolution processing is the same.

3. The multi-scale feature fusion method based on target detection according to claim 2, is characterized in that: the concrete implementation mode of channel dimension fusion processing in the described step 2) is: the feature map is compressed into a three-dimensional vector on the passband dimension , and then fuse feature maps of different scales in the channel dimension.

4. The multi-scale feature fusion method based on target detection according to claim 3, characterized in that: the specific implementation of compressing the feature map into a three-dimensional vector in the passband dimension is:

The feature map is subjected to the unfold operation to convert the feature map from a four-dimensional tensor to a three-dimensional tensor; the dimension of the four-dimensional tensor is [B, C, H, W], and the dimension of the three-dimensional tensor is [B, C', L] ;

in:

C' satisfies the following relationship:

C'=C*K*K

in:

C is the channel size of the original feature map;

K is the size of the convolution kernel;

C' is the sliding window size;

L satisfies the following relationship:

L=H'*W'

in:

H is the length of the original feature map;

W is the width of the original feature map;

pading is the padding size;

K is the size of the convolution kernel;

stride is the step size;

L is the number of sliding windows.

5. The multi-scale feature fusion method based on target detection according to claim 4, characterized in that: the specific implementation of the feature maps of different scales in channel dimension fusion is:

a) The feature map C3 obtains C'3 after one unfold operation; the feature map C4 obtains C'4_1 and C'4_2 after two unfold operations; the feature map C5 obtains C'5 after one unfold operation; among which: C' 3 and C'4_1 are equal in size, C'4_2 and C'5 are equal in size;

b) fuse C'3 and C'4_1 to obtain a three-dimensional tensor, and fuse C'4_2 and C'5 to obtain a three-dimensional tensor;

c) Restore the result of step b) to four four-dimensional tensors by reshape, wherein the size of the first four-dimensional tensor is the same as that of the feature map C3, the size of the second four-dimensional tensor and the size of the third four-dimensional tensor They are the same size as feature map C4 respectively, and the size of the fourth four-dimensional tensor is the same as that of feature map C5;

d) The first four-dimensional tensor obtains the feature map P3_1 fused on the channel dimension; the fourth four-dimensional tensor obtains the feature map P5_1 fused on the channel dimension; the second four-dimensional tensor and the third four-dimensional tensor are added to obtain the feature map P5_1. The feature map P4_1 fused in the channel dimension; the size of the feature map P3_1, the size of the feature map P4_1, and the size of the feature map P5_1 are all non-equivalent.

6. The multi-scale feature fusion method based on target detection according to claim 1, characterized in that: in the step 2), the spatial dimension fusion processing is to upsample the feature maps, and then the feature maps of different scales are stored in the space. Dimensional fusion.

7. The multi-scale feature fusion method based on target detection according to claim 6, characterized in that: the specific implementation of spatial dimension fusion processing in the step 2) is:

The feature map C5 is upsampled and the feature map C4 is fused to obtain P4_2, and the P4_2 is then upsampled and the feature map C3 is fused to obtain P3_2, and finally the feature map P3_2, feature map P4_2 and feature map P5_2 fused in the spatial dimension are obtained; The size of the feature map P3_2, the size of the feature map P4_2, and the size of the feature map P5_2 are not equivalent.

8. The multi-scale feature fusion method based on target detection according to any one of claims 1-7, wherein the specific implementation of the fusion processing in the step 3) is:

3.1) Integrate feature maps of the same size in channel dimension and spatial dimension;

3.2) Perform ablation and down-sampling operations on the integrated results of step 3.1), and finally realize feature fusion of feature maps of different sizes in the two dimensions of space and channel.