CN114821331B - Remote sensing image weak supervision target detection method and system based on self-attention mechanism - Google Patents
Remote sensing image weak supervision target detection method and system based on self-attention mechanism Download PDFInfo
- Publication number
- CN114821331B CN114821331B CN202210524417.2A CN202210524417A CN114821331B CN 114821331 B CN114821331 B CN 114821331B CN 202210524417 A CN202210524417 A CN 202210524417A CN 114821331 B CN114821331 B CN 114821331B
- Authority
- CN
- China
- Prior art keywords
- module
- self
- image
- target
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 56
- 238000001514 detection method Methods 0.000 title claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 49
- 238000000034 method Methods 0.000 claims abstract description 17
- 238000002372 labelling Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 2
- 238000000605 extraction Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 239000000969 carrier Substances 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000007430 reference method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000005871 repellent Substances 0.000 description 1
- 230000002940 repellent Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/60—Analysis of geometric attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10032—Satellite or aerial image; Remote sensing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Geometry (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a remote sensing image weak supervision target detection method and a system based on a self-attention mechanism, wherein the method comprises the following steps: acquiring a training image and a candidate frame with the training image, and acquiring image-level annotation information corresponding to the training image; inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module; inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image. According to the method, the recognition model models the relation between the candidate frames through a self-attention mechanism, so that richer information can be obtained, and further, a better detection result is obtained.
Description
Technical Field
The invention relates to the field of digital image processing, in particular to the technical field of deep learning and weak supervision target detection, and particularly relates to a remote sensing image weak supervision target detection method and system based on a self-attention mechanism.
Background
The target detection technique is a technique of finding objects of interest in an image from image features and determining their positions and categories. In the present day of deep learning, the target detection algorithm based on the convolutional neural network is developed rapidly, and algorithms with phenotypes like Faster-RCNN, yolov3, SSD and the like are emerging. The weak supervision target detection technology is a technology for training an algorithm model by using image-level labels (i.e. a truth label only gives which types of targets exist in an image to be detected and does not give specific position information of the targets), and still can give the positions and the types of the targets of interest in a test stage.
At present, remote sensing data is in explosive growth, but the processing and utilization capacity of the remote sensing image is not increased synchronously with the remote sensing data quantity, the automatic processing level of remote sensing image analysis is relatively low, the obtained target information has phenomena of missing report, false report and the like, and the higher and higher practical requirements are difficult to meet. The expert alone relies on interpretation and is difficult to discover valuable information from a large amount of ocean background visible light remote sensing images rapidly and accurately, information is more difficult to generate in real time, the image data size is large, and the undistorted transmission process is time-consuming and has poor reliability. In order to solve the problem that object-level markers are difficult to obtain, a target detection algorithm based on weak supervised learning has been developed.
On the one hand, when the manual annotation is performed, the difficulty of performing image-level annotation (such as the left half of fig. 1) is far lower than that of object-level annotation (such as the right half of fig. 1), so that the training data set can be constructed with higher efficiency. On the other hand, due to the existence of the search engine, people can easily acquire samples with specific image-level labels through the network, so that the workload of constructing a data set is further reduced. Therefore, in practical application, the target detection method based on weak supervised learning is more promising.
The existing mainstream weak supervision target detection technology is that a large number of candidate frames are extracted from an image in advance through a certain candidate frame extraction algorithm, and a feature is extracted for each candidate frame. And then, independently scoring the category and the target possibility of the characteristics contained in each candidate frame by utilizing a multi-example learning technology, wherein the candidate frame with high score is the output result of the weak supervision detection.
However, existing algorithms process all the candidate box features separately, and do not take into account the relationship between the candidate box features (e.g., two candidate boxes are part of the same object, respectively, and then have similar features), and do not utilize the information between the candidate boxes. In addition, for the image, the position and size information of the candidate frame is also a basis for judging whether the candidate frame is a target, and the existing weak supervision algorithm does not consider the additional prior information brought by the aspect.
Disclosure of Invention
The invention mainly aims to provide a remote sensing image weak supervision target detection method and a remote sensing image weak supervision target detection system based on a self-attention mechanism, which model the relation between candidate frames through the self-attention mechanism, so as to provide global information for the selection of candidate frame pseudo labels, and solve the problem of small prediction frame leading caused by the processing of the remote sensing image by a weak supervision target detection algorithm. According to the method, the position and size information of the candidate frame is taken as a part of the characteristics to be spliced with the original characteristics, so that richer information can be obtained, and a better detection result is obtained.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
The embodiment of the invention provides a remote sensing image weak supervision target detection method based on a self-attention mechanism, which comprises the following steps:
s10, acquiring a training image and a candidate frame with the training image, and obtaining image-level annotation information corresponding to the training image;
S20, inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
S30, inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image.
Further, in the step S30, the encoding flow of the encoder module is as follows:
Extracting high-dimensional characteristics F epsilon R d×1 and corresponding position and size codes P epsilon R d×1 for target candidate frames output by the candidate frame cluster learning module;
generating a feature map M epsilon R d×m according to the dimension d of the high-dimensional feature and the number M of target candidate frames;
mapping the feature map M into Q, K and V through different linear mapping layers to obtain a new self-attention feature map:
The new self-attention feature map M new passes through three self-attention mechanism layers to obtain the encoded candidate frame feature map M *∈Rd×m.
Further, in the step S30, a decoding flow of the decoder module is as follows:
The input of the decoder module is a feature map H epsilon R d×n formed by a group of query vectors q epsilon R d×1 obtained through learning, and a coded candidate frame feature map M * obtained by the encoder module, wherein n is the number of the query vectors;
mapping the input of the decoder module into Q ', K ', V ' through different linear mapping layers to obtain the update of the H matrix:
The updated H new passes through three layers of self-attention mechanisms to obtain the decoded query vector feature map H *∈Rd×n.
Further, in the step S30, outputting the identification result includes:
And predicting the position, size and category of the target in the image by using the decoded query vector set H *, and outputting a prediction result.
Further, predicting the position, size and category of the target in the image by using the decoded query vector set H *, and outputting a prediction result; comprising the following steps:
H * obtains a class result cls epsilon R n×1 of the query vector through a linear layer, wherein class 0 in the class result is set as a background class, and classes 1 to k are interesting target classes;
H * passes through a multi-layer perceptron to obtain the position size result obj E R n×4 of the query vector, wherein the position size result obj i=[cxi,cyi,wi,hi]∈R1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x and y axes respectively.
Further, in the step S20, the loss function used by the self-attention mechanism module part is:
Ltrans=λ1Lbbox+λ2Lcls
wherein lambda 1、λ2 is the coefficient respectively;
Lbbox=λL1||bpred-btruth||1
L bbox represents an L1 loss function used by the candidate frame position information, b pred represents predicted size position information of the detection frame, including the length, width and position of a center point in the image, and b truth represents size position information of the detection frame of the pseudo tag generated by the algorithm;
l cls denotes the Focal loss function used by the class information, where α and γ are weight factors, control the shape of the loss curve, c pred denotes the class score of the predicted detection box, and c truth denotes the class information of the detection box of the algorithm-generated pseudo tag.
In a second aspect, an embodiment of the present invention further provides a remote sensing image weakly supervised target detection system based on a self-attention mechanism, including:
the acquisition module is used for acquiring the training image, the candidate frames and the image-level annotation information corresponding to the training image;
The training module is used for inputting training images, candidate frames and labeling information into the recognition model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
The detection module is used for inputting the image of the target to be identified and the candidate frame into the trained identification model, and outputting an identification result through the candidate frame cluster learning module, the encoder module and the decoder module in sequence; the identification result comprises: the corresponding target position, size and category in the image.
Compared with the prior art, the invention has the following beneficial effects:
The remote sensing image weak supervision target detection method based on the self-attention mechanism provided by the embodiment of the invention comprises the following steps: acquiring a training image and a candidate frame with the training image, and acquiring image-level annotation information corresponding to the training image; inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module; inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image. According to the method, the recognition model models the relation between the candidate frames through a self-attention mechanism, and the position and size information of the candidate frames are taken as part of the characteristics to be spliced with the original characteristics, so that richer information can be obtained, and a better detection result is obtained.
Drawings
FIG. 1 is a schematic diagram of a prior art image containing a manual annotation;
Fig. 2 is a flowchart of a remote sensing image weak supervision target detection method based on a self-attention mechanism according to an embodiment of the present invention;
Fig. 3 is a schematic diagram of a basic identification model according to an embodiment of the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific direction, be configured and operated in the specific direction, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "provided," "connected," and the like are to be construed broadly, and may be fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
The remote sensing image weak supervision target detection method based on the self-attention mechanism provided by the embodiment of the invention, as shown in fig. 1, comprises the following steps:
s10, acquiring a training image and a candidate frame with the training image, and obtaining image-level annotation information corresponding to the training image;
S20, inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
S30, inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image.
In this embodiment, for example, taking identifying the targets of an aircraft carrier and an attack carrier as an example, firstly, an image required to identify the targets is obtained, input into an identification model, and then an output identification result is obtained; wherein the recognition model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module. The identification result may be that no aircraft carrier and/or attack carrier target to be identified exists in the image, or may be the position, size and type of the identified aircraft carrier and/or attack carrier.
The recognition model in the step S20 is based on the existing weakly supervised candidate frame cluster learning algorithm (Proposal Cluster Learning for Weakly Supervised Object Detection, PCL), and the relationship between the candidate frames is modeled by constructing an additional self-attention codec branch, and the size and position information of the upper candidate frame are used as additional prior knowledge, so as to improve the detection result.
The overall flow is as shown in fig. 3:
1. Encoder module section:
In fig. 3, the upper half frame is the frame of the original PCL algorithm, and the lower half frame is the self-attention mechanism module, belonging to the improvement branch. The input of the improved branch is F epsilon R d×1 and its position and size code P epsilon R d×1, wherein R represents real space, which is obtained by the high-dimensional characteristics extracted from each candidate frame after CNN and SPP layers are reduced by linear layers. Each candidate box inputs characteristics of the coding layer:
F*=F+P∈Rd×1
Where d is the dimension of the reduced dimension feature, in this embodiment, d is set to 128. Compared with 4096 before dimension reduction, the calculation is performed by utilizing the feature after dimension reduction, so that the calculated amount is greatly reduced, and the calculation time is saved. For an input image, if the number of candidate frames obtained by the candidate frame extraction algorithm is M, the feature map input to the encoder layer is M e R d×m.
And inputting M into an encoder layer, namely encoding and learning the relation information between the high-dimensional features F of the candidate frames, so as to obtain the candidate frame features endowed with relation semantics and position size information degree between the candidate frames. The encoder layer is structured as a self-attention mechanism layer, is widely applied in the field of computer vision, and firstly maps M into Q, K and V through different linear mapping layers, and then obtains a new self-attention feature map:
And continuously passing through three self-attention mechanism layers to obtain the coded candidate frame feature map M *∈Rd×m.
2. Decoder module portion:
In fig. 3, the input of the decoder is a decoding feature map H e R d×n formed by a group of query vectors q e R d×1 (object query) obtained by learning and splicing them together in the second dimension, and an encoded candidate frame feature map M * obtained by the encoder, where n is the number of query vectors. The decoder is similar in structure to the encoder, but the input Q ' is linearly mapped from H, and K ' and V ' are linearly mapped from M *. Updating the H decoding feature map matrix through a self-attention mechanism layer:
The decoded query vector feature map H *∈Rd×n is also obtained by sequentially passing through three self-attention mechanism layers. Each query vector q may be regarded as a dynamic anchor frame that can find possible target positions from the image based on the information given by the encoder.
3. Prediction part:
The position, size and category of the target in the image can be predicted by using the decoded query vector set H *. H * obtains a class result cls epsilon R n×1 of the query vector through a linear layer, wherein class 0 in the class result is set as a background class, and classes 1 to k are interesting target classes. Meanwhile, H * passes through a multi-layer perceptron to obtain a position size result obj E R n×4 of the query vector, wherein the position size result obj i=[cxi,cyi,wi,hi]∈R1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x axis and the y axis respectively. At this time, the query vector is utilized to predict, so that the multi-example classification problem can be converted into the prediction problem of the target set, and the prediction result is not limited to the position given by the candidate frame, but focuses on the global information of the whole image, so that the feature extraction network can obtain more comprehensive information, and the feature extraction capability is improved.
4. Loss function:
The penalty function used by the self-attention mechanism module portion is the difference between the predicted result set and the pseudo tag truth result generated by the PCL portion, and the detection box information, wherein the candidate box position information uses an L1 penalty function, and the category information uses a Focal loss function:
Lbbox=λL1||bpred-btruth||1
the two loss functions are organically combined to obtain the loss function of the part:
Ltrans=λ1Lbbox+λ2Lcls
Lambda 1、λ2 are coefficients, respectively, 0.4,1;
b pred represents the predicted size and position information of the detection frame, including the length, width and the position of the center point in the image, and b truth represents the size and position information of the detection frame of the pseudo tag generated by the algorithm;
L cls represents the Focal loss function used by the class information, where α and γ are weight factors, control the shape of the loss curve, c pred represents the class score of the predicted detection box, and c truth represents the class information of the detection box of the algorithmically generated pseudo tag.
The self-attention mechanism module can obtain the capability of predicting the target class by utilizing the loss function, the feature extraction part of the whole frame can be jointly trained together with the PCL part by utilizing the feedback of the part, and then the detection performance of an algorithm is improved by utilizing the association information between candidate frames and the size and position information of the candidate frames, and the accurate identification of the target is finally realized.
For example: and carrying out target detection on the remote sensing image by adopting a target detection method of the weak supervision remote sensing image based on the self-attention mechanism module. In the experimental section, the disclosed Worldview remote sensing ship image dataset was used. 14252 remote sensing ship images are obtained by utilizing the data set segmentation, the sizes of the remote sensing ship images are 1024 multiplied by 1024, the remote sensing ship images totally comprise 4 kinds of 56539 target examples, and the categories are as follows: aircraft carriers, amphibious offensive carriers, repellents and other vessels. The evaluation indexes used in the experiment are mAP and Corloc, wherein mAP is tested on 1650 test sets, and the higher the value is, the better the detection result is; corLoc is tested on 12602 training sets, with higher values representing better training results for the algorithm.
TABLE 1 comparison of the sex mAP of the inventive method (Ours) with the reference method PCL
Method of | mAP@0.1 | mAP@0.3 | mAP@0.5 |
PCL(baseline) | 68.04 | 37.48 | 11.10 |
Ours | 70.14 | 38.70 | 11.55 |
TABLE 2 CorLoc comparison of the present patent method with reference method PCL
From the table, the quantitative result of the method provided by the invention can be seen from the result CorLoc on the training set, so that more accurate candidate frames can be found during training, and the training effect of the model is improved. Because the model utilizes more accurate candidate frames during training, the detection index mAP is better than the reference algorithm PCL under different IoU thresholds, and the method provided by the invention can enable the model to obtain more accurate detection frames on a test set. The results in the two tables prove the effectiveness of the method provided by the invention and the application value of the weak supervision target detection algorithm in the remote sensing field.
According to the remote sensing image weak supervision target detection method based on the self-attention mechanism, the images to be identified are detected through the identification model, the limitation that the existing algorithm only judges a single candidate frame is broken through by the identification model, and the detection result is improved by extracting the relation information among the candidate frames through the self-attention mechanism module. And the size and the position information of the candidate frame are added as additional information, so that the accuracy of selecting the candidate frame is improved. Further, the encoding and decoding device is constructed to predict by using the query vector, the multi-example classification problem is converted into the prediction problem of the target set, and the training effect of the extraction network is improved by using the information of the whole image, so that the accurate identification of the target is facilitated.
Based on the same inventive concept, the embodiment of the invention also provides a remote sensing image weak supervision target detection system based on a self-attention mechanism, and the principle of the system for solving the problem is similar to that of a remote sensing image weak supervision target detection method based on the self-attention mechanism, so that the implementation of the system can be referred to the implementation of the method, and the repetition is omitted.
The remote sensing image weak supervision target detection system based on the self-attention mechanism provided by the embodiment of the invention comprises the following components:
the acquisition module is used for acquiring the training image, the candidate frames and the image-level annotation information corresponding to the training image;
The training module is used for inputting training images, candidate frames and labeling information into the recognition model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
The detection module is used for inputting the image of the target to be identified and the candidate frame into the trained identification model, and outputting an identification result through the candidate frame cluster learning module, the encoder module and the decoder module in sequence; the identification result comprises: the corresponding target position, size and category in the image.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (5)
1. The remote sensing image weak supervision target detection method based on the self-attention mechanism is characterized by comprising the following steps of:
s10, acquiring a training image and a candidate frame with the training image, and obtaining image-level annotation information corresponding to the training image;
S20, inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
S30, inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image;
In the step S30, the encoding flow of the encoder module is as follows:
extracting high-dimensional characteristics F epsilon R d×1 and corresponding position and size codes P epsilon R d×1 for target candidate frames output by the candidate frame cluster learning module; r represents real space;
generating a feature map M epsilon R d×m according to the dimension d of the high-dimensional feature and the number M of target candidate frames;
mapping the feature map M into Q, K and V through different linear mapping layers to obtain a new self-attention feature map:
The new self-attention feature map M new passes through three self-attention mechanism layers to obtain a coded candidate frame feature map M *∈Rd ×m;
in the step S30, the decoding process of the decoder module is as follows:
The input of the decoder module is a feature map H epsilon R d×n formed by a group of query vectors q epsilon R d×1 obtained through learning, and a coded candidate frame feature map M * obtained by the encoder module, wherein n is the number of the query vectors;
mapping the input of the decoder module into Q ', K ', V ' through different linear mapping layers to obtain the update of the H matrix:
The updated H new passes through three layers of self-attention mechanisms to obtain the decoded query vector feature map H *∈Rd×n.
2. The method for detecting a weakly supervised target of a remote sensing image based on a self attention mechanism as set forth in claim 1, wherein in the step S30, outputting the recognition result includes:
And predicting the position, size and category of the target in the image by using the decoded query vector set H *, and outputting a prediction result.
3. The method for detecting the weakly supervised target of the remote sensing image based on the self attention mechanism as set forth in claim 2, wherein the position, the size and the category of the target in the image are predicted by using the decoded query vector set H *, and a prediction result is output; comprising the following steps:
H * obtains a class result cls epsilon R n×1 of the query vector through a linear layer, wherein class 0 in the class result is set as a background class, and classes 1 to k are interesting target classes;
H * passes through a multi-layer perceptron to obtain the position size result obj E R n×4 of the query vector, wherein the position size result obj i=[cxi,cyi,wi,hi]∈R1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x axis and the y axis respectively.
4. The method of claim 3, wherein in the step S20, the loss function used by the self-attention mechanism module part is:
Ltrans=λ1Lbbox+λ2Lcls
wherein lambda 1、λ2 is the coefficient respectively;
Lbbox=λL1||bpred-btruth||1
L bbox represents an L1 loss function used by the candidate frame position information, b pred represents predicted size position information of the detection frame, including the length, width and position of a center point in the image, and b truth represents size position information of the detection frame of the pseudo tag generated by the algorithm;
L cls denotes the Focalloss penalty function used for class information, where α and γ are weight factors, control the shape of the loss curve, c pred denotes the class score of the predicted detection box, and c truth denotes the class information of the algorithm generated false label detection box.
5. The remote sensing image weak supervision target detection system based on the self-attention mechanism is characterized by comprising the following components:
the acquisition module is used for acquiring the training image, the candidate frames and the image-level annotation information corresponding to the training image;
The training module is used for inputting training images, candidate frames and labeling information into the recognition model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
The detection module is used for inputting the image of the target to be identified and the candidate frame into the trained identification model, and outputting an identification result through the candidate frame cluster learning module, the encoder module and the decoder module in sequence; the identification result comprises: the corresponding target position, size and category in the image;
The coding flow of the coder module is as follows:
extracting high-dimensional characteristics F epsilon R d×1 and corresponding position and size codes P epsilon R d×1 for target candidate frames output by the candidate frame cluster learning module; r represents real space;
generating a feature map M epsilon R d×m according to the dimension d of the high-dimensional feature and the number M of target candidate frames;
mapping the feature map M into Q, K and V through different linear mapping layers to obtain a new self-attention feature map:
The new self-attention feature map M new passes through three self-attention mechanism layers to obtain a coded candidate frame feature map M *∈Rd ×m;
the decoding flow of the decoder module is as follows:
The input of the decoder module is a feature map H epsilon R d×n formed by a group of query vectors q epsilon R d×1 obtained through learning, and a coded candidate frame feature map M * obtained by the encoder module, wherein n is the number of the query vectors;
mapping the input of the decoder module into Q ', K ', V ' through different linear mapping layers to obtain the update of the H matrix:
The updated H new passes through three layers of self-attention mechanisms to obtain the decoded query vector feature map H *∈Rd×n.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210524417.2A CN114821331B (en) | 2022-05-13 | 2022-05-13 | Remote sensing image weak supervision target detection method and system based on self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210524417.2A CN114821331B (en) | 2022-05-13 | 2022-05-13 | Remote sensing image weak supervision target detection method and system based on self-attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114821331A CN114821331A (en) | 2022-07-29 |
CN114821331B true CN114821331B (en) | 2024-11-05 |
Family
ID=82514398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210524417.2A Active CN114821331B (en) | 2022-05-13 | 2022-05-13 | Remote sensing image weak supervision target detection method and system based on self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114821331B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765921A (en) * | 2019-10-18 | 2020-02-07 | 北京工业大学 | Video object positioning method based on weak supervised learning and video spatiotemporal features |
CN113378829A (en) * | 2020-12-15 | 2021-09-10 | 浙江大学 | Weak supervision target detection method based on positive and negative sample balance |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113902926B (en) * | 2021-12-06 | 2022-05-31 | 之江实验室 | General image target detection method and device based on self-attention mechanism |
-
2022
- 2022-05-13 CN CN202210524417.2A patent/CN114821331B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765921A (en) * | 2019-10-18 | 2020-02-07 | 北京工业大学 | Video object positioning method based on weak supervised learning and video spatiotemporal features |
CN113378829A (en) * | 2020-12-15 | 2021-09-10 | 浙江大学 | Weak supervision target detection method based on positive and negative sample balance |
Also Published As
Publication number | Publication date |
---|---|
CN114821331A (en) | 2022-07-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107885764B (en) | Rapid Hash vehicle retrieval method based on multitask deep learning | |
CN112084331B (en) | Text processing and model training method and device, computer equipment and storage medium | |
CN107679250B (en) | Multi-task layered image retrieval method based on deep self-coding convolutional neural network | |
CN108108657A (en) | A kind of amendment local sensitivity Hash vehicle retrieval method based on multitask deep learning | |
CN110909673A (en) | Pedestrian re-identification method based on natural language description | |
CN117011883A (en) | Pedestrian re-recognition method based on pyramid convolution and transducer double branches | |
CN115131313A (en) | Hyperspectral image change detection method and device based on Transformer | |
CN114913546A (en) | Method and system for detecting character interaction relationship | |
CN115909280A (en) | Traffic sign recognition algorithm based on multi-head attention mechanism | |
CN116580243A (en) | Cross-domain remote sensing scene classification method for mask image modeling guide domain adaptation | |
CN115690549A (en) | Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model | |
CN118314353B (en) | Remote sensing image segmentation method based on double-branch multi-scale feature fusion | |
CN114821331B (en) | Remote sensing image weak supervision target detection method and system based on self-attention mechanism | |
CN117893737A (en) | Jellyfish identification and classification method based on YOLOv-LED | |
CN117829243A (en) | Model training method, target detection device, electronic equipment and medium | |
CN117152504A (en) | Space correlation guided prototype distillation small sample classification method | |
CN116524543A (en) | Multi-mode unsupervised pedestrian re-identification method, device, equipment and storage medium | |
CN116524258A (en) | Landslide detection method and system based on multi-label classification | |
CN115424275A (en) | Fishing boat brand identification method and system based on deep learning technology | |
CN116992947A (en) | Model training method, video query method and device | |
CN115934966A (en) | Automatic labeling method based on remote sensing image recommendation information | |
CN115035455A (en) | Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation | |
CN114168780A (en) | Multimodal data processing method, electronic device, and storage medium | |
Liu et al. | L2-LiteSeg: A Real-Time Semantic Segmentation Method for End-to-End Autonomous Driving | |
CN118410200B (en) | Remote sensing image retrieval method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |