WO2023273337A1

WO2023273337A1 - Representative feature-based method for detecting dense targets in remote sensing image

Info

Publication number: WO2023273337A1
Application number: PCT/CN2022/074542
Authority: WO
Inventors: 胡凡; 方效林; 吴文甲; 杨明; 罗军舟
Original assignee: 南京逸智网络空间技术创新研究院有限公司
Priority date: 2021-06-29
Filing date: 2022-01-28
Publication date: 2023-01-05
Also published as: CN113536986A; CN113536986B

Abstract

A representative feature-based method for detecting dense targets in a remote sensing image, comprising: constructing a feature extraction network, a feature pyramid network, a preliminary prediction network, and a final prediction network, and sequentially inputting a remote sensing image to be detected into the feature extraction network and the feature pyramid network to output a preliminary feature map; inputting the preliminary feature map into the preliminary prediction network, and selecting a representative feature of semantic information of each category from all categories in a data set and representative confidence of each category in the whole feature map; inputting a feature map outputted by the preliminary prediction network into the final prediction network to obtain a final feature map, and calculating a similarity between the representative feature of a same category and a feature vector at the same position of the final feature map; and by taking the similarity as a weight, adaptively improving classification confidence on the basis of classification confidence of a hard positive sample.

Description

A Dense Object Detection Method in Remote Sensing Images Based on Representative Features

technical field

The invention relates to target detection, in particular to a dense target detection method in remote sensing images based on representative features.

Background technique

Remote sensing technology is a rapidly developing high-tech. The information network it forms provides people with a large amount of scientific data and dynamic information. Remote sensing image detection is the benchmark problem of target detection. It is used in many fields such as agriculture, meteorological surveying and mapping, and environmental protection. It has great application value.

With the great success of deep learning algorithm in the field of computer vision, it has been considered as the preferred method for remote sensing image processing. Due to the bird's-eye view and larger spatial field of view, there are more dense scenes in remote sensing images and contain a large number of densely arranged objects. In the target detection method based on deep learning, the target category corresponding to the ground truth label, the The sample is a positive sample, and the positive sample with a large error between the predicted value of the target classification confidence and the true value label is a difficult sample. Existing excellent detection models can detect most objects in the image, but often miss some difficult positive samples that are more difficult to detect. When detecting difficult samples, when the classification confidence of the target detection model for the positive sample prediction is lower than the set confidence threshold, the difficult positive samples will be filtered out in the post-processing stage, resulting in a decrease in the detection performance of the detection model; or artificially Lowering the confidence threshold in the post-processing stage of the network makes the detection model lose the ability to suppress low-confidence negative samples. Therefore, it is more challenging to accurately detect densely arranged multiple objects in remote sensing images.

Contents of the invention

Purpose of the invention: In view of the above problems, the purpose of the present invention is to provide a dense target detection method in remote sensing images based on representative features, by adaptively increasing the classification confidence of difficult positive samples, and then accurately detecting densely arranged multiple targets in remote sensing images. similar objects.

Technical solution: A dense target detection method in remote sensing images based on representative features of the present invention, comprising the following steps:

(1) Construct four network modules, including a feature extraction network, a feature pyramid network, a preliminary prediction network and a final prediction network, input the remote sensing image to be detected into the feature extraction network and the feature pyramid network in turn, and output a preliminary feature map;

(2) Input the preliminary feature map into the preliminary prediction network, select the representative features of each category semantic information in all categories of the data set and each category represents the confidence in the entire feature map;

(3) Input the feature map output by the preliminary prediction network into the final prediction network to obtain the final feature map, and calculate the similarity between the representative features of the same category and the same position feature vector of the final feature map;

(4) Take the similarity obtained in step 3 as the weight, and adaptively increase the classification confidence based on the classification confidence of difficult positive samples, as the final classification confidence of difficult positive samples.

Further, the process of obtaining the highest classification confidence and representative features in step 2 is:

(201) In the classification branch of the preliminary prediction network, calculate the classification confidence of each category at the H×W position of the entire feature map

Where H is the length of the feature map, W is the width, and k is the category of the data set;

(202) at

Find the highest classification confidence as the representative confidence RepConfidences of category k, and find the position (h, w) where the highest classification confidence is obtained, where h is the length and w is the width;

(203) Extract the feature information of the hth row and wth column in the preliminary feature map FM _FAM

It is used to represent the representative feature RepFeature _k of category k, where FM _FAM is the previous layer feature map shared by the classification branch and the regression branch of the preliminary prediction network;

(204) Set a classification confidence threshold, and only when the representative confidence of category k is greater than the classification confidence threshold, the representative feature of category k is an effective representative feature.

Further, the similarity in the step 3 includes feature semantic similarity and feature space similarity, and the feature semantic similarity calculation process includes:

Using the embedded Gaussian similarity measurement function to calculate the feature semantic information similarity, and normalize the measurement method adopted, the described embedded Gaussian similarity measurement function is:

Where RF _k represents the representative feature RepFeature _k of the k-th category, and F _hw represents the feature vector of the h-th row and w-th column in the feature map FM _ODM output by the final prediction network

Feature vector RepFeature _k ,

Both are 1×1×n-dimensional, and i represents the eigenvalue of the i-th dimension in n dimensions;

Take the form of a linear embedding space:

φ(RF _k )＝W _φ RF _k

θ(F _hw )＝W _θ F _hw

Where W _φ , W _θ are learning weight matrices; φ(RF _k ) ⁱ , θ(F _hw ) ⁱ respectively represent the eigenvalues of the two eigenvectors in each dimension;

N(φ(RF)) is the normalization factor, by calculating the sum of the similarities between the feature vector F _hw of the hth row and wth column in the final prediction network and K effective representative features RF _k , K is the data The number of categories in the set, and the embedded Gaussian similarity is normalized to a range of 0 to 1 to avoid the gradient explosion problem caused by excessive similarity. The formula for calculating the normalization factor is as follows;

Further, the feature space similarity calculation process includes the following steps:

(301) Calculate feature vector RepFeature _k and

The spatial distance dis(RF _k ,F _hw ) in the dimension of the feature map is calculated as:

in

is the abscissa and ordinate of the feature vector RepFeature _k in the feature map,

is the eigenvector

The horizontal and vertical coordinates in the feature map;

(302) Multiply dis(RF _k , F _hw ) by the stride _i of each feature map to obtain the spatial distance Corr _{Spatial_i} (RF _k , F _hw ) of the two feature vectors on the original image, the calculation formula is:

Among them, Spatial_i means that RF _k and F _hw are taken from the i-th layer feature map from the bottom up of the feature pyramid network, and α is the scale parameter;

Therefore, the similarity expression in the step 3 is:

Similarity(RF _k ,F _hw )

=Sim _{Embedded_Gaussian} (RF _k ,F _hw )+Corr _{Spatial_i} (RF _k ,F _hw )

Further, the final classification confidence of the difficult positive sample in step 4 is added by weight to the confidence of the position at (h, w) of the final prediction network feature map with respect to category k by representing the confidence of category k

Realized above, the calculation formula is:

Further, the method for measuring the feature semantic similarity includes using any one of Euclidean similarity, cosine similarity or Gaussian similarity.

Further, the feature extraction network uses a convolutional layer to reduce the size of the original image, and the extracted effective features are input to the feature pyramid network; the feature extraction network selects a ResNet or HRNet convolutional neural network.

Further, the preliminary prediction network selects the feature alignment module in the S ² A-NET model to preliminarily predict the category information and location information of the object.

Further, the final prediction network selects the rotation detection module in the S ² A-NET model to predict the final category information and position information of the object.

Beneficial effect: the present invention compares with prior art, and its remarkable advantage is:

1. The present invention uses representative features and representative confidence to adaptively improve the classification confidence of difficult positive samples, and improves the classification ability of difficult positive samples in dense remote sensing image scenes;

2. Use the two-stage classification branch parameters to ensure the consistency of the similarity calculation process and reduce the complexity of the detection model and the amount of network parameters.

Description of drawings

Fig. 1 is a schematic diagram of the representative feature acquisition process of the present invention;

Fig. 2 is the flow chart of computing similarity of the present invention;

Fig. 3 is a schematic diagram of improving classification confidence for difficult positive samples in the present invention.

detailed description

The dense target detection method in a remote sensing image based on representative features described in this embodiment includes the following steps:

(1) Construct four network modules, including feature extraction network, feature pyramid network, preliminary prediction network and final prediction network, input the remote sensing image to be detected into the feature extraction network, use the convolutional layer to reduce the size of the original image, feature extraction The network inputs the extracted effective features into the feature pyramid network, and then outputs the preliminary feature map FM _FAM .

The feature extraction network selects ResNet or HRNet convolutional neural network; the preliminary prediction network selects the feature alignment module FAM in the S ² A-NET model; the final prediction network selects the rotation detection module ODM in the S ² A-NET model.

(2) Input the preliminary feature map FM _FAM into the feature alignment module FAM, select the representative features of each category semantic information in all categories of the data set and each category represents the confidence in the entire feature map, as shown in Figure 1 , the process is:

Wherein H is the length of the feature map, W is the width, and k is the category of the data set. The data set of this embodiment contains 15 object categories, and the 15 categories are calculated sequentially;

(202) at

It is used to represent the representative feature RepFeature _k of category k, where FM _FAM is the previous layer feature map shared by the classification branch and regression branch of the preliminary prediction network; FM _FAM contains related features of object category and location information at the same time, used Calculation of the similarity between subsequent features; the feature map FM _FAM is H×W×C dimension, where H and W are the length and width of the feature map, and C is the number of channels of the feature map. In this embodiment, C is 256;

(204) Set the classification confidence threshold to ensure the best balance between the reliability of the representative feature and the difficulty of becoming a representative feature. In this embodiment, the threshold is set to 0.6. Only when the representative confidence of category k is greater than 0.6, The representative features of category k are effective representative features. When RepConfidence _k is low, such as 0.3, 0.4, the probability of RepFeature _k itself belonging to category k is also low, and cannot be an effective representative feature.

(3) Input the feature map output by the preliminary prediction network into the final prediction network to obtain the final feature map, and calculate the similarity between the representative features of the same category and the feature vectors at the same position in the final feature map. The flow chart is shown in Figure 2 . Similarity includes feature semantic similarity and feature space similarity,

The measurement method of feature semantic similarity can adopt any one of Euclidean similarity, cosine similarity or Gaussian similarity. In this implementation, Gaussian similarity is used to calculate feature semantic similarity. The process is as follows:

The embedded Gaussian similarity measurement function is used to calculate the similarity of feature semantic information, and the measurement method used is normalized. The embedded Gaussian similarity measurement function is:

Feature vector RepFeature _k ,

Both are 1×1×n dimensions, and the dimension n of the feature vector in this embodiment is 256, and i represents the feature value of the i-th dimension in the n dimensions;

Take the form of a linear embedding space:

φ(RF _k )＝W _φ RF _k

θ(F _hw )＝W _θ F _hw

N(φ(RF)) is the normalization factor. By calculating the sum of the similarities between the feature vector F _hw of the hth row and wth column in the final prediction network and the 15 effective representative features RF _k , the embedded Gaussian The similarity is normalized to a range of 0 to 1 to avoid the gradient explosion problem caused by excessive similarity. The formula for calculating the normalization factor is as follows:

The feature space similarity calculation process includes the following steps:

(301) Calculate feature vector RepFeature _k and

in

is the eigenvector

The horizontal and vertical coordinates in the feature map; when training the model, use the feature pyramid network to experience the 5-layer feature map from the bottom up for prediction, and the stride _i values of the 5-layer feature map are 8, 16, 32, 64, 128 respectively ;

Among them, Spatial_i means that RF _k and F _hw are taken from the i-th layer feature map from the bottom up of the feature pyramid network, and α is a scale parameter. In this embodiment, α is set to 1/64 so that two features with closer distances can be There is a high spatial location correlation.

The similarity expression is:

Similarity(RF _k ,F _hw )

=Sim _{Embedded_Gaussian} (RF _k ,F _hw )+Corr _{Spatial_i} (RF _k ,F _hw )

(4) Take the similarity obtained in step 3 as the weight, and adaptively increase the classification confidence based on the classification confidence of difficult positive samples, as the final classification confidence of difficult positive samples, as shown in Figure 3.

The final classification confidence of the difficult positive sample is weighted by adding the representative confidence of category k to the confidence of category k at the (h, w)th position of the final prediction network feature map

Realized above, the calculation formula is:

Claims

A dense target detection method in remote sensing images based on representative features, characterized in that it comprises the following steps:

(1) Construct four network modules, including a feature extraction network, a feature pyramid network, a preliminary prediction network and a final prediction network, input the remote sensing image to be detected into the feature extraction network and the feature pyramid network in turn, and output a preliminary feature map;

(2) Input the preliminary feature map into the preliminary prediction network, select the representative features of each category semantic information in all categories of the data set and each category represents the confidence in the entire feature map;

(3) Input the feature map output by the preliminary prediction network into the final prediction network to obtain the final feature map, and calculate the similarity between the representative features of the same category and the same position feature vector of the final feature map;

(4) Take the similarity obtained in step 3 as the weight, and adaptively increase the classification confidence based on the classification confidence of difficult positive samples, as the final classification confidence of difficult positive samples.
The dense target detection method according to claim 1, wherein the process of obtaining the highest classification confidence and representative features in the step 2 is:

(201) In the classification branch of the preliminary prediction network, calculate the classification confidence of each category at the H×W position of the entire feature map
Where H is the length of the feature map, W is the width, and k is the category of the data set;

(202) at
Find the highest classification confidence as the representative confidence RepConfidences of category k, and find the position (h, w) where the highest classification confidence is obtained, where h is the length and w is the width;

(203) Extract the feature information of the hth row and wth column in the preliminary feature map FM FAM
It is used to represent the representative feature RepFeature k of category k, where FM FAM is the previous layer feature map shared by the classification branch and the regression branch of the preliminary prediction network;

(204) Set a classification confidence threshold, and only when the representative confidence of category k is greater than the classification confidence threshold, the representative feature of category k is an effective representative feature.
The dense target detection method according to claim 2, wherein the similarity in the step 3 includes a feature semantic similarity and a feature space similarity, and the feature semantic similarity calculation process includes:

The embedding Gaussian similarity measurement function is used to calculate the similarity of feature semantic information, and the measurement method adopted is normalized. The embedded Gaussian similarity measurement function is:

Where RF k represents the representative feature RepFeature k of the k-th category, and F hw represents the feature vector of the h-th row and w-th column in the feature map FM ODM output by the final prediction network
Feature vector RepFeature k ,
Both are 1×1×n-dimensional, and i represents the eigenvalue of the i-th dimension in n dimensions;

Take the form of a linear embedding space:

φ(RF k )＝W φ RF k

θ(F hw )＝W θ F hw

Where W φ , W θ are learning weight matrices; φ(RF k ) i , θ(F hw ) i respectively represent the eigenvalues of the two eigenvectors in each dimension;

N(φ(RF)) is the normalization factor. By calculating the sum of the similarities between the feature vector F hw of the hth row and wth column in the final prediction network and K effective representative features RF k , the embedded Gaussian The similarity is normalized to a range of 0 to 1 to avoid the gradient explosion problem caused by excessive similarity. The formula for calculating the normalization factor is as follows:
The dense target detection method according to claim 3, wherein the feature space similarity calculation process comprises the following steps:

(301) Calculate feature vector RepFeature k and
The spatial distance dis(RF k ,F hw ) in the dimension of the feature map is calculated as:

in
is the abscissa and ordinate of the feature vector RepFeature k in the feature map,
is the eigenvector
The horizontal and vertical coordinates in the feature map;

(302) Multiply dis(RF k , F hw ) by the stride i of each feature map to obtain the spatial distance Corr Spatial_i (RF k , F hw ) of the two feature vectors on the original image, the calculation formula is:

Among them, Spatial_i means that RF k and F hw are taken from the i-th layer feature map from the bottom up of the feature pyramid network, and α is the scale parameter;

Therefore, the similarity expression in the step 3 is:

Similarity(RF k ,F hw )

=Sim Embedded_Gaussian (RF k ,F hw )+Corr Spatial_i (RF k ,F hw )
The dense target detection method according to claim 4, wherein the final classification confidence of the difficult positive sample in step 4 is added to the (h, The confidence of the position at w) about the category k
Realized above, the calculation formula is:
The dense target detection method according to claim 3, wherein the method for measuring feature semantic similarity includes any one of Euclidean similarity, cosine similarity or Gaussian similarity.
The dense target detection method according to claim 1, wherein the feature extraction network uses a convolutional layer to reduce the size of the original image, and the effective features extracted are input to the feature pyramid network; the feature extraction network selects ResNet or HRNet Convolutional Neural Network.
The dense target detection method according to claim 1, wherein the feature alignment module in the S 2 A-NET model is selected for the preliminary prediction network.
The dense target detection method according to claim 1, wherein the final prediction network selects the rotation detection module in the S 2 A-NET model.