CN114821331B

CN114821331B - Remote sensing image weak supervision target detection method and system based on self-attention mechanism

Info

Publication number: CN114821331B
Application number: CN202210524417.2A
Authority: CN
Inventors: 张浩鹏; 谭智文; 姜志国; 谢凤英; 赵丹培
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2024-11-05
Anticipated expiration: 2042-05-13
Also published as: CN114821331A

Abstract

The invention discloses a remote sensing image weak supervision target detection method and a system based on a self-attention mechanism, wherein the method comprises the following steps: acquiring a training image and a candidate frame with the training image, and acquiring image-level annotation information corresponding to the training image; inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module; inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image. According to the method, the recognition model models the relation between the candidate frames through a self-attention mechanism, so that richer information can be obtained, and further, a better detection result is obtained.

Description

Remote sensing image weak supervision target detection method and system based on self-attention mechanism

Technical Field

The invention relates to the field of digital image processing, in particular to the technical field of deep learning and weak supervision target detection, and particularly relates to a remote sensing image weak supervision target detection method and system based on a self-attention mechanism.

Background

The target detection technique is a technique of finding objects of interest in an image from image features and determining their positions and categories. In the present day of deep learning, the target detection algorithm based on the convolutional neural network is developed rapidly, and algorithms with phenotypes like Faster-RCNN, yolov3, SSD and the like are emerging. The weak supervision target detection technology is a technology for training an algorithm model by using image-level labels (i.e. a truth label only gives which types of targets exist in an image to be detected and does not give specific position information of the targets), and still can give the positions and the types of the targets of interest in a test stage.

At present, remote sensing data is in explosive growth, but the processing and utilization capacity of the remote sensing image is not increased synchronously with the remote sensing data quantity, the automatic processing level of remote sensing image analysis is relatively low, the obtained target information has phenomena of missing report, false report and the like, and the higher and higher practical requirements are difficult to meet. The expert alone relies on interpretation and is difficult to discover valuable information from a large amount of ocean background visible light remote sensing images rapidly and accurately, information is more difficult to generate in real time, the image data size is large, and the undistorted transmission process is time-consuming and has poor reliability. In order to solve the problem that object-level markers are difficult to obtain, a target detection algorithm based on weak supervised learning has been developed.

On the one hand, when the manual annotation is performed, the difficulty of performing image-level annotation (such as the left half of fig. 1) is far lower than that of object-level annotation (such as the right half of fig. 1), so that the training data set can be constructed with higher efficiency. On the other hand, due to the existence of the search engine, people can easily acquire samples with specific image-level labels through the network, so that the workload of constructing a data set is further reduced. Therefore, in practical application, the target detection method based on weak supervised learning is more promising.

The existing mainstream weak supervision target detection technology is that a large number of candidate frames are extracted from an image in advance through a certain candidate frame extraction algorithm, and a feature is extracted for each candidate frame. And then, independently scoring the category and the target possibility of the characteristics contained in each candidate frame by utilizing a multi-example learning technology, wherein the candidate frame with high score is the output result of the weak supervision detection.

However, existing algorithms process all the candidate box features separately, and do not take into account the relationship between the candidate box features (e.g., two candidate boxes are part of the same object, respectively, and then have similar features), and do not utilize the information between the candidate boxes. In addition, for the image, the position and size information of the candidate frame is also a basis for judging whether the candidate frame is a target, and the existing weak supervision algorithm does not consider the additional prior information brought by the aspect.

Disclosure of Invention

The invention mainly aims to provide a remote sensing image weak supervision target detection method and a remote sensing image weak supervision target detection system based on a self-attention mechanism, which model the relation between candidate frames through the self-attention mechanism, so as to provide global information for the selection of candidate frame pseudo labels, and solve the problem of small prediction frame leading caused by the processing of the remote sensing image by a weak supervision target detection algorithm. According to the method, the position and size information of the candidate frame is taken as a part of the characteristics to be spliced with the original characteristics, so that richer information can be obtained, and a better detection result is obtained.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

The embodiment of the invention provides a remote sensing image weak supervision target detection method based on a self-attention mechanism, which comprises the following steps:

s10, acquiring a training image and a candidate frame with the training image, and obtaining image-level annotation information corresponding to the training image;

S20, inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;

S30, inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image.

Further, in the step S30, the encoding flow of the encoder module is as follows:

Extracting high-dimensional characteristics F epsilon R ^d×1 and corresponding position and size codes P epsilon R ^d×1 for target candidate frames output by the candidate frame cluster learning module;

generating a feature map M epsilon R ^d×m according to the dimension d of the high-dimensional feature and the number M of target candidate frames;

mapping the feature map M into Q, K and V through different linear mapping layers to obtain a new self-attention feature map:

The new self-attention feature map M ^new passes through three self-attention mechanism layers to obtain the encoded candidate frame feature map M ^*∈R^d×m.

Further, in the step S30, a decoding flow of the decoder module is as follows:

The input of the decoder module is a feature map H epsilon R ^d×n formed by a group of query vectors q epsilon R ^d×1 obtained through learning, and a coded candidate frame feature map M ^* obtained by the encoder module, wherein n is the number of the query vectors;

mapping the input of the decoder module into Q ', K ', V ' through different linear mapping layers to obtain the update of the H matrix:

The updated H ^new passes through three layers of self-attention mechanisms to obtain the decoded query vector feature map H ^*∈R^d×n.

Further, in the step S30, outputting the identification result includes:

And predicting the position, size and category of the target in the image by using the decoded query vector set H ^*, and outputting a prediction result.

Further, predicting the position, size and category of the target in the image by using the decoded query vector set H ^*, and outputting a prediction result; comprising the following steps:

H ^* obtains a class result cls epsilon R ^n×1 of the query vector through a linear layer, wherein class 0 in the class result is set as a background class, and classes 1 to k are interesting target classes;

H ^* passes through a multi-layer perceptron to obtain the position size result obj E R ^n×4 of the query vector, wherein the position size result obj _i＝[cx_i,cy_i,w_i,h_i]∈R^1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x and y axes respectively.

Further, in the step S20, the loss function used by the self-attention mechanism module part is:

L_trans＝λ₁L_bbox+λ₂L_cls

wherein lambda ₁、λ₂ is the coefficient respectively;

L_bbox＝λ_L1||b_pred-b_truth||₁

L _bbox represents an L1 loss function used by the candidate frame position information, b _pred represents predicted size position information of the detection frame, including the length, width and position of a center point in the image, and b _truth represents size position information of the detection frame of the pseudo tag generated by the algorithm;

l _cls denotes the Focal loss function used by the class information, where α and γ are weight factors, control the shape of the loss curve, c _pred denotes the class score of the predicted detection box, and c _truth denotes the class information of the detection box of the algorithm-generated pseudo tag.

In a second aspect, an embodiment of the present invention further provides a remote sensing image weakly supervised target detection system based on a self-attention mechanism, including:

the acquisition module is used for acquiring the training image, the candidate frames and the image-level annotation information corresponding to the training image;

The training module is used for inputting training images, candidate frames and labeling information into the recognition model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;

The detection module is used for inputting the image of the target to be identified and the candidate frame into the trained identification model, and outputting an identification result through the candidate frame cluster learning module, the encoder module and the decoder module in sequence; the identification result comprises: the corresponding target position, size and category in the image.

Compared with the prior art, the invention has the following beneficial effects:

The remote sensing image weak supervision target detection method based on the self-attention mechanism provided by the embodiment of the invention comprises the following steps: acquiring a training image and a candidate frame with the training image, and acquiring image-level annotation information corresponding to the training image; inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module; inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image. According to the method, the recognition model models the relation between the candidate frames through a self-attention mechanism, and the position and size information of the candidate frames are taken as part of the characteristics to be spliced with the original characteristics, so that richer information can be obtained, and a better detection result is obtained.

Drawings

FIG. 1 is a schematic diagram of a prior art image containing a manual annotation;

Fig. 2 is a flowchart of a remote sensing image weak supervision target detection method based on a self-attention mechanism according to an embodiment of the present invention;

Fig. 3 is a schematic diagram of a basic identification model according to an embodiment of the present invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific direction, be configured and operated in the specific direction, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "provided," "connected," and the like are to be construed broadly, and may be fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

The remote sensing image weak supervision target detection method based on the self-attention mechanism provided by the embodiment of the invention, as shown in fig. 1, comprises the following steps:

In this embodiment, for example, taking identifying the targets of an aircraft carrier and an attack carrier as an example, firstly, an image required to identify the targets is obtained, input into an identification model, and then an output identification result is obtained; wherein the recognition model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module. The identification result may be that no aircraft carrier and/or attack carrier target to be identified exists in the image, or may be the position, size and type of the identified aircraft carrier and/or attack carrier.

The recognition model in the step S20 is based on the existing weakly supervised candidate frame cluster learning algorithm (Proposal Cluster Learning for Weakly Supervised Object Detection, PCL), and the relationship between the candidate frames is modeled by constructing an additional self-attention codec branch, and the size and position information of the upper candidate frame are used as additional prior knowledge, so as to improve the detection result.

The overall flow is as shown in fig. 3:

1. Encoder module section:

In fig. 3, the upper half frame is the frame of the original PCL algorithm, and the lower half frame is the self-attention mechanism module, belonging to the improvement branch. The input of the improved branch is F epsilon R ^d×1 and its position and size code P epsilon R ^d×1, wherein R represents real space, which is obtained by the high-dimensional characteristics extracted from each candidate frame after CNN and SPP layers are reduced by linear layers. Each candidate box inputs characteristics of the coding layer:

F^*＝F+P∈R^d×1

Where d is the dimension of the reduced dimension feature, in this embodiment, d is set to 128. Compared with 4096 before dimension reduction, the calculation is performed by utilizing the feature after dimension reduction, so that the calculated amount is greatly reduced, and the calculation time is saved. For an input image, if the number of candidate frames obtained by the candidate frame extraction algorithm is M, the feature map input to the encoder layer is M e R ^d×m.

And inputting M into an encoder layer, namely encoding and learning the relation information between the high-dimensional features F of the candidate frames, so as to obtain the candidate frame features endowed with relation semantics and position size information degree between the candidate frames. The encoder layer is structured as a self-attention mechanism layer, is widely applied in the field of computer vision, and firstly maps M into Q, K and V through different linear mapping layers, and then obtains a new self-attention feature map:

And continuously passing through three self-attention mechanism layers to obtain the coded candidate frame feature map M ^*∈R^d×m.

2. Decoder module portion:

In fig. 3, the input of the decoder is a decoding feature map H e R ^d×n formed by a group of query vectors q e R ^d×1 (object query) obtained by learning and splicing them together in the second dimension, and an encoded candidate frame feature map M ^* obtained by the encoder, where n is the number of query vectors. The decoder is similar in structure to the encoder, but the input Q ' is linearly mapped from H, and K ' and V ' are linearly mapped from M ^*. Updating the H decoding feature map matrix through a self-attention mechanism layer:

The decoded query vector feature map H ^*∈R^d×n is also obtained by sequentially passing through three self-attention mechanism layers. Each query vector q may be regarded as a dynamic anchor frame that can find possible target positions from the image based on the information given by the encoder.

3. Prediction part:

The position, size and category of the target in the image can be predicted by using the decoded query vector set H ^*. H ^* obtains a class result cls epsilon R ^n×1 of the query vector through a linear layer, wherein class 0 in the class result is set as a background class, and classes 1 to k are interesting target classes. Meanwhile, H ^* passes through a multi-layer perceptron to obtain a position size result obj E R ^n×4 of the query vector, wherein the position size result obj _i＝[cx_i,cy_i,w_i,h_i]∈R^1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x axis and the y axis respectively. At this time, the query vector is utilized to predict, so that the multi-example classification problem can be converted into the prediction problem of the target set, and the prediction result is not limited to the position given by the candidate frame, but focuses on the global information of the whole image, so that the feature extraction network can obtain more comprehensive information, and the feature extraction capability is improved.

4. Loss function:

The penalty function used by the self-attention mechanism module portion is the difference between the predicted result set and the pseudo tag truth result generated by the PCL portion, and the detection box information, wherein the candidate box position information uses an L1 penalty function, and the category information uses a Focal loss function:

L_bbox＝λ_L1||b_pred-b_truth||₁

the two loss functions are organically combined to obtain the loss function of the part:

L_trans＝λ₁L_bbox+λ₂L_cls

Lambda ₁、λ₂ are coefficients, respectively, 0.4,1;

b _pred represents the predicted size and position information of the detection frame, including the length, width and the position of the center point in the image, and b _truth represents the size and position information of the detection frame of the pseudo tag generated by the algorithm;

L _cls represents the Focal loss function used by the class information, where α and γ are weight factors, control the shape of the loss curve, c _pred represents the class score of the predicted detection box, and c _truth represents the class information of the detection box of the algorithmically generated pseudo tag.

The self-attention mechanism module can obtain the capability of predicting the target class by utilizing the loss function, the feature extraction part of the whole frame can be jointly trained together with the PCL part by utilizing the feedback of the part, and then the detection performance of an algorithm is improved by utilizing the association information between candidate frames and the size and position information of the candidate frames, and the accurate identification of the target is finally realized.

For example: and carrying out target detection on the remote sensing image by adopting a target detection method of the weak supervision remote sensing image based on the self-attention mechanism module. In the experimental section, the disclosed Worldview remote sensing ship image dataset was used. 14252 remote sensing ship images are obtained by utilizing the data set segmentation, the sizes of the remote sensing ship images are 1024 multiplied by 1024, the remote sensing ship images totally comprise 4 kinds of 56539 target examples, and the categories are as follows: aircraft carriers, amphibious offensive carriers, repellents and other vessels. The evaluation indexes used in the experiment are mAP and Corloc, wherein mAP is tested on 1650 test sets, and the higher the value is, the better the detection result is; corLoc is tested on 12602 training sets, with higher values representing better training results for the algorithm.

TABLE 1 comparison of the sex mAP of the inventive method (Ours) with the reference method PCL

Method of	mAP@0.1	mAP@0.3	mAP@0.5
				PCL(baseline)	68.04	37.48	11.10
Ours	70.14	38.70	11.55

TABLE 2 CorLoc comparison of the present patent method with reference method PCL

From the table, the quantitative result of the method provided by the invention can be seen from the result CorLoc on the training set, so that more accurate candidate frames can be found during training, and the training effect of the model is improved. Because the model utilizes more accurate candidate frames during training, the detection index mAP is better than the reference algorithm PCL under different IoU thresholds, and the method provided by the invention can enable the model to obtain more accurate detection frames on a test set. The results in the two tables prove the effectiveness of the method provided by the invention and the application value of the weak supervision target detection algorithm in the remote sensing field.

According to the remote sensing image weak supervision target detection method based on the self-attention mechanism, the images to be identified are detected through the identification model, the limitation that the existing algorithm only judges a single candidate frame is broken through by the identification model, and the detection result is improved by extracting the relation information among the candidate frames through the self-attention mechanism module. And the size and the position information of the candidate frame are added as additional information, so that the accuracy of selecting the candidate frame is improved. Further, the encoding and decoding device is constructed to predict by using the query vector, the multi-example classification problem is converted into the prediction problem of the target set, and the training effect of the extraction network is improved by using the information of the whole image, so that the accurate identification of the target is facilitated.

Based on the same inventive concept, the embodiment of the invention also provides a remote sensing image weak supervision target detection system based on a self-attention mechanism, and the principle of the system for solving the problem is similar to that of a remote sensing image weak supervision target detection method based on the self-attention mechanism, so that the implementation of the system can be referred to the implementation of the method, and the repetition is omitted.

The remote sensing image weak supervision target detection system based on the self-attention mechanism provided by the embodiment of the invention comprises the following components:

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The remote sensing image weak supervision target detection method based on the self-attention mechanism is characterized by comprising the following steps of:

S30, inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image;

In the step S30, the encoding flow of the encoder module is as follows:

extracting high-dimensional characteristics F epsilon R ^d×1 and corresponding position and size codes P epsilon R ^d×1 for target candidate frames output by the candidate frame cluster learning module; r represents real space;

The new self-attention feature map M ^new passes through three self-attention mechanism layers to obtain a coded candidate frame feature map M ^*∈R^d ^×m;

in the step S30, the decoding process of the decoder module is as follows:

2. The method for detecting a weakly supervised target of a remote sensing image based on a self attention mechanism as set forth in claim 1, wherein in the step S30, outputting the recognition result includes:

3. The method for detecting the weakly supervised target of the remote sensing image based on the self attention mechanism as set forth in claim 2, wherein the position, the size and the category of the target in the image are predicted by using the decoded query vector set H ^*, and a prediction result is output; comprising the following steps:

H ^* passes through a multi-layer perceptron to obtain the position size result obj E R ^n×4 of the query vector, wherein the position size result obj _i＝[cx_i,cy_i,w_i,h_i]∈R^1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x axis and the y axis respectively.

4. The method of claim 3, wherein in the step S20, the loss function used by the self-attention mechanism module part is:

L_trans＝λ₁L_bbox+λ₂L_cls

wherein lambda ₁、λ₂ is the coefficient respectively;

L_bbox＝λ_L1||b_pred-b_truth||₁

L _cls denotes the Focalloss penalty function used for class information, where α and γ are weight factors, control the shape of the loss curve, c _pred denotes the class score of the predicted detection box, and c _truth denotes the class information of the algorithm generated false label detection box.

5. The remote sensing image weak supervision target detection system based on the self-attention mechanism is characterized by comprising the following components:

The detection module is used for inputting the image of the target to be identified and the candidate frame into the trained identification model, and outputting an identification result through the candidate frame cluster learning module, the encoder module and the decoder module in sequence; the identification result comprises: the corresponding target position, size and category in the image;

The coding flow of the coder module is as follows:

the decoding flow of the decoder module is as follows: