[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114821331B - Remote sensing image weak supervision target detection method and system based on self-attention mechanism - Google Patents

Remote sensing image weak supervision target detection method and system based on self-attention mechanism Download PDF

Info

Publication number
CN114821331B
CN114821331B CN202210524417.2A CN202210524417A CN114821331B CN 114821331 B CN114821331 B CN 114821331B CN 202210524417 A CN202210524417 A CN 202210524417A CN 114821331 B CN114821331 B CN 114821331B
Authority
CN
China
Prior art keywords
module
self
image
target
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210524417.2A
Other languages
Chinese (zh)
Other versions
CN114821331A (en
Inventor
张浩鹏
谭智文
姜志国
谢凤英
赵丹培
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202210524417.2A priority Critical patent/CN114821331B/en
Publication of CN114821331A publication Critical patent/CN114821331A/en
Application granted granted Critical
Publication of CN114821331B publication Critical patent/CN114821331B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing image weak supervision target detection method and a system based on a self-attention mechanism, wherein the method comprises the following steps: acquiring a training image and a candidate frame with the training image, and acquiring image-level annotation information corresponding to the training image; inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module; inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image. According to the method, the recognition model models the relation between the candidate frames through a self-attention mechanism, so that richer information can be obtained, and further, a better detection result is obtained.

Description

Remote sensing image weak supervision target detection method and system based on self-attention mechanism
Technical Field
The invention relates to the field of digital image processing, in particular to the technical field of deep learning and weak supervision target detection, and particularly relates to a remote sensing image weak supervision target detection method and system based on a self-attention mechanism.
Background
The target detection technique is a technique of finding objects of interest in an image from image features and determining their positions and categories. In the present day of deep learning, the target detection algorithm based on the convolutional neural network is developed rapidly, and algorithms with phenotypes like Faster-RCNN, yolov3, SSD and the like are emerging. The weak supervision target detection technology is a technology for training an algorithm model by using image-level labels (i.e. a truth label only gives which types of targets exist in an image to be detected and does not give specific position information of the targets), and still can give the positions and the types of the targets of interest in a test stage.
At present, remote sensing data is in explosive growth, but the processing and utilization capacity of the remote sensing image is not increased synchronously with the remote sensing data quantity, the automatic processing level of remote sensing image analysis is relatively low, the obtained target information has phenomena of missing report, false report and the like, and the higher and higher practical requirements are difficult to meet. The expert alone relies on interpretation and is difficult to discover valuable information from a large amount of ocean background visible light remote sensing images rapidly and accurately, information is more difficult to generate in real time, the image data size is large, and the undistorted transmission process is time-consuming and has poor reliability. In order to solve the problem that object-level markers are difficult to obtain, a target detection algorithm based on weak supervised learning has been developed.
On the one hand, when the manual annotation is performed, the difficulty of performing image-level annotation (such as the left half of fig. 1) is far lower than that of object-level annotation (such as the right half of fig. 1), so that the training data set can be constructed with higher efficiency. On the other hand, due to the existence of the search engine, people can easily acquire samples with specific image-level labels through the network, so that the workload of constructing a data set is further reduced. Therefore, in practical application, the target detection method based on weak supervised learning is more promising.
The existing mainstream weak supervision target detection technology is that a large number of candidate frames are extracted from an image in advance through a certain candidate frame extraction algorithm, and a feature is extracted for each candidate frame. And then, independently scoring the category and the target possibility of the characteristics contained in each candidate frame by utilizing a multi-example learning technology, wherein the candidate frame with high score is the output result of the weak supervision detection.
However, existing algorithms process all the candidate box features separately, and do not take into account the relationship between the candidate box features (e.g., two candidate boxes are part of the same object, respectively, and then have similar features), and do not utilize the information between the candidate boxes. In addition, for the image, the position and size information of the candidate frame is also a basis for judging whether the candidate frame is a target, and the existing weak supervision algorithm does not consider the additional prior information brought by the aspect.
Disclosure of Invention
The invention mainly aims to provide a remote sensing image weak supervision target detection method and a remote sensing image weak supervision target detection system based on a self-attention mechanism, which model the relation between candidate frames through the self-attention mechanism, so as to provide global information for the selection of candidate frame pseudo labels, and solve the problem of small prediction frame leading caused by the processing of the remote sensing image by a weak supervision target detection algorithm. According to the method, the position and size information of the candidate frame is taken as a part of the characteristics to be spliced with the original characteristics, so that richer information can be obtained, and a better detection result is obtained.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
The embodiment of the invention provides a remote sensing image weak supervision target detection method based on a self-attention mechanism, which comprises the following steps:
s10, acquiring a training image and a candidate frame with the training image, and obtaining image-level annotation information corresponding to the training image;
S20, inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
S30, inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image.
Further, in the step S30, the encoding flow of the encoder module is as follows:
Extracting high-dimensional characteristics F epsilon R d×1 and corresponding position and size codes P epsilon R d×1 for target candidate frames output by the candidate frame cluster learning module;
generating a feature map M epsilon R d×m according to the dimension d of the high-dimensional feature and the number M of target candidate frames;
mapping the feature map M into Q, K and V through different linear mapping layers to obtain a new self-attention feature map:
The new self-attention feature map M new passes through three self-attention mechanism layers to obtain the encoded candidate frame feature map M *∈Rd×m.
Further, in the step S30, a decoding flow of the decoder module is as follows:
The input of the decoder module is a feature map H epsilon R d×n formed by a group of query vectors q epsilon R d×1 obtained through learning, and a coded candidate frame feature map M * obtained by the encoder module, wherein n is the number of the query vectors;
mapping the input of the decoder module into Q ', K ', V ' through different linear mapping layers to obtain the update of the H matrix:
The updated H new passes through three layers of self-attention mechanisms to obtain the decoded query vector feature map H *∈Rd×n.
Further, in the step S30, outputting the identification result includes:
And predicting the position, size and category of the target in the image by using the decoded query vector set H *, and outputting a prediction result.
Further, predicting the position, size and category of the target in the image by using the decoded query vector set H *, and outputting a prediction result; comprising the following steps:
H * obtains a class result cls epsilon R n×1 of the query vector through a linear layer, wherein class 0 in the class result is set as a background class, and classes 1 to k are interesting target classes;
H * passes through a multi-layer perceptron to obtain the position size result obj E R n×4 of the query vector, wherein the position size result obj i=[cxi,cyi,wi,hi]∈R1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x and y axes respectively.
Further, in the step S20, the loss function used by the self-attention mechanism module part is:
Ltrans=λ1Lbbox2Lcls
wherein lambda 1、λ2 is the coefficient respectively;
Lbbox=λL1||bpred-btruth||1
L bbox represents an L1 loss function used by the candidate frame position information, b pred represents predicted size position information of the detection frame, including the length, width and position of a center point in the image, and b truth represents size position information of the detection frame of the pseudo tag generated by the algorithm;
l cls denotes the Focal loss function used by the class information, where α and γ are weight factors, control the shape of the loss curve, c pred denotes the class score of the predicted detection box, and c truth denotes the class information of the detection box of the algorithm-generated pseudo tag.
In a second aspect, an embodiment of the present invention further provides a remote sensing image weakly supervised target detection system based on a self-attention mechanism, including:
the acquisition module is used for acquiring the training image, the candidate frames and the image-level annotation information corresponding to the training image;
The training module is used for inputting training images, candidate frames and labeling information into the recognition model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
The detection module is used for inputting the image of the target to be identified and the candidate frame into the trained identification model, and outputting an identification result through the candidate frame cluster learning module, the encoder module and the decoder module in sequence; the identification result comprises: the corresponding target position, size and category in the image.
Compared with the prior art, the invention has the following beneficial effects:
The remote sensing image weak supervision target detection method based on the self-attention mechanism provided by the embodiment of the invention comprises the following steps: acquiring a training image and a candidate frame with the training image, and acquiring image-level annotation information corresponding to the training image; inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module; inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image. According to the method, the recognition model models the relation between the candidate frames through a self-attention mechanism, and the position and size information of the candidate frames are taken as part of the characteristics to be spliced with the original characteristics, so that richer information can be obtained, and a better detection result is obtained.
Drawings
FIG. 1 is a schematic diagram of a prior art image containing a manual annotation;
Fig. 2 is a flowchart of a remote sensing image weak supervision target detection method based on a self-attention mechanism according to an embodiment of the present invention;
Fig. 3 is a schematic diagram of a basic identification model according to an embodiment of the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific direction, be configured and operated in the specific direction, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "provided," "connected," and the like are to be construed broadly, and may be fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
The remote sensing image weak supervision target detection method based on the self-attention mechanism provided by the embodiment of the invention, as shown in fig. 1, comprises the following steps:
s10, acquiring a training image and a candidate frame with the training image, and obtaining image-level annotation information corresponding to the training image;
S20, inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
S30, inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image.
In this embodiment, for example, taking identifying the targets of an aircraft carrier and an attack carrier as an example, firstly, an image required to identify the targets is obtained, input into an identification model, and then an output identification result is obtained; wherein the recognition model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module. The identification result may be that no aircraft carrier and/or attack carrier target to be identified exists in the image, or may be the position, size and type of the identified aircraft carrier and/or attack carrier.
The recognition model in the step S20 is based on the existing weakly supervised candidate frame cluster learning algorithm (Proposal Cluster Learning for Weakly Supervised Object Detection, PCL), and the relationship between the candidate frames is modeled by constructing an additional self-attention codec branch, and the size and position information of the upper candidate frame are used as additional prior knowledge, so as to improve the detection result.
The overall flow is as shown in fig. 3:
1. Encoder module section:
In fig. 3, the upper half frame is the frame of the original PCL algorithm, and the lower half frame is the self-attention mechanism module, belonging to the improvement branch. The input of the improved branch is F epsilon R d×1 and its position and size code P epsilon R d×1, wherein R represents real space, which is obtained by the high-dimensional characteristics extracted from each candidate frame after CNN and SPP layers are reduced by linear layers. Each candidate box inputs characteristics of the coding layer:
F*=F+P∈Rd×1
Where d is the dimension of the reduced dimension feature, in this embodiment, d is set to 128. Compared with 4096 before dimension reduction, the calculation is performed by utilizing the feature after dimension reduction, so that the calculated amount is greatly reduced, and the calculation time is saved. For an input image, if the number of candidate frames obtained by the candidate frame extraction algorithm is M, the feature map input to the encoder layer is M e R d×m.
And inputting M into an encoder layer, namely encoding and learning the relation information between the high-dimensional features F of the candidate frames, so as to obtain the candidate frame features endowed with relation semantics and position size information degree between the candidate frames. The encoder layer is structured as a self-attention mechanism layer, is widely applied in the field of computer vision, and firstly maps M into Q, K and V through different linear mapping layers, and then obtains a new self-attention feature map:
And continuously passing through three self-attention mechanism layers to obtain the coded candidate frame feature map M *∈Rd×m.
2. Decoder module portion:
In fig. 3, the input of the decoder is a decoding feature map H e R d×n formed by a group of query vectors q e R d×1 (object query) obtained by learning and splicing them together in the second dimension, and an encoded candidate frame feature map M * obtained by the encoder, where n is the number of query vectors. The decoder is similar in structure to the encoder, but the input Q ' is linearly mapped from H, and K ' and V ' are linearly mapped from M *. Updating the H decoding feature map matrix through a self-attention mechanism layer:
The decoded query vector feature map H *∈Rd×n is also obtained by sequentially passing through three self-attention mechanism layers. Each query vector q may be regarded as a dynamic anchor frame that can find possible target positions from the image based on the information given by the encoder.
3. Prediction part:
The position, size and category of the target in the image can be predicted by using the decoded query vector set H *. H * obtains a class result cls epsilon R n×1 of the query vector through a linear layer, wherein class 0 in the class result is set as a background class, and classes 1 to k are interesting target classes. Meanwhile, H * passes through a multi-layer perceptron to obtain a position size result obj E R n×4 of the query vector, wherein the position size result obj i=[cxi,cyi,wi,hi]∈R1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x axis and the y axis respectively. At this time, the query vector is utilized to predict, so that the multi-example classification problem can be converted into the prediction problem of the target set, and the prediction result is not limited to the position given by the candidate frame, but focuses on the global information of the whole image, so that the feature extraction network can obtain more comprehensive information, and the feature extraction capability is improved.
4. Loss function:
The penalty function used by the self-attention mechanism module portion is the difference between the predicted result set and the pseudo tag truth result generated by the PCL portion, and the detection box information, wherein the candidate box position information uses an L1 penalty function, and the category information uses a Focal loss function:
Lbbox=λL1||bpred-btruth||1
the two loss functions are organically combined to obtain the loss function of the part:
Ltrans=λ1Lbbox2Lcls
Lambda 1、λ2 are coefficients, respectively, 0.4,1;
b pred represents the predicted size and position information of the detection frame, including the length, width and the position of the center point in the image, and b truth represents the size and position information of the detection frame of the pseudo tag generated by the algorithm;
L cls represents the Focal loss function used by the class information, where α and γ are weight factors, control the shape of the loss curve, c pred represents the class score of the predicted detection box, and c truth represents the class information of the detection box of the algorithmically generated pseudo tag.
The self-attention mechanism module can obtain the capability of predicting the target class by utilizing the loss function, the feature extraction part of the whole frame can be jointly trained together with the PCL part by utilizing the feedback of the part, and then the detection performance of an algorithm is improved by utilizing the association information between candidate frames and the size and position information of the candidate frames, and the accurate identification of the target is finally realized.
For example: and carrying out target detection on the remote sensing image by adopting a target detection method of the weak supervision remote sensing image based on the self-attention mechanism module. In the experimental section, the disclosed Worldview remote sensing ship image dataset was used. 14252 remote sensing ship images are obtained by utilizing the data set segmentation, the sizes of the remote sensing ship images are 1024 multiplied by 1024, the remote sensing ship images totally comprise 4 kinds of 56539 target examples, and the categories are as follows: aircraft carriers, amphibious offensive carriers, repellents and other vessels. The evaluation indexes used in the experiment are mAP and Corloc, wherein mAP is tested on 1650 test sets, and the higher the value is, the better the detection result is; corLoc is tested on 12602 training sets, with higher values representing better training results for the algorithm.
TABLE 1 comparison of the sex mAP of the inventive method (Ours) with the reference method PCL
Method of mAP@0.1 mAP@0.3 mAP@0.5
PCL(baseline) 68.04 37.48 11.10
Ours 70.14 38.70 11.55
TABLE 2 CorLoc comparison of the present patent method with reference method PCL
From the table, the quantitative result of the method provided by the invention can be seen from the result CorLoc on the training set, so that more accurate candidate frames can be found during training, and the training effect of the model is improved. Because the model utilizes more accurate candidate frames during training, the detection index mAP is better than the reference algorithm PCL under different IoU thresholds, and the method provided by the invention can enable the model to obtain more accurate detection frames on a test set. The results in the two tables prove the effectiveness of the method provided by the invention and the application value of the weak supervision target detection algorithm in the remote sensing field.
According to the remote sensing image weak supervision target detection method based on the self-attention mechanism, the images to be identified are detected through the identification model, the limitation that the existing algorithm only judges a single candidate frame is broken through by the identification model, and the detection result is improved by extracting the relation information among the candidate frames through the self-attention mechanism module. And the size and the position information of the candidate frame are added as additional information, so that the accuracy of selecting the candidate frame is improved. Further, the encoding and decoding device is constructed to predict by using the query vector, the multi-example classification problem is converted into the prediction problem of the target set, and the training effect of the extraction network is improved by using the information of the whole image, so that the accurate identification of the target is facilitated.
Based on the same inventive concept, the embodiment of the invention also provides a remote sensing image weak supervision target detection system based on a self-attention mechanism, and the principle of the system for solving the problem is similar to that of a remote sensing image weak supervision target detection method based on the self-attention mechanism, so that the implementation of the system can be referred to the implementation of the method, and the repetition is omitted.
The remote sensing image weak supervision target detection system based on the self-attention mechanism provided by the embodiment of the invention comprises the following components:
the acquisition module is used for acquiring the training image, the candidate frames and the image-level annotation information corresponding to the training image;
The training module is used for inputting training images, candidate frames and labeling information into the recognition model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
The detection module is used for inputting the image of the target to be identified and the candidate frame into the trained identification model, and outputting an identification result through the candidate frame cluster learning module, the encoder module and the decoder module in sequence; the identification result comprises: the corresponding target position, size and category in the image.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (5)

1. The remote sensing image weak supervision target detection method based on the self-attention mechanism is characterized by comprising the following steps of:
s10, acquiring a training image and a candidate frame with the training image, and obtaining image-level annotation information corresponding to the training image;
S20, inputting training images, candidate frames and labeling information into an identification model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
S30, inputting the image of the target to be identified and the candidate frame into a trained identification model, and sequentially passing through the candidate frame cluster learning module, the encoder module and the decoder module to output an identification result; the identification result comprises: the corresponding target position, size and category in the image;
In the step S30, the encoding flow of the encoder module is as follows:
extracting high-dimensional characteristics F epsilon R d×1 and corresponding position and size codes P epsilon R d×1 for target candidate frames output by the candidate frame cluster learning module; r represents real space;
generating a feature map M epsilon R d×m according to the dimension d of the high-dimensional feature and the number M of target candidate frames;
mapping the feature map M into Q, K and V through different linear mapping layers to obtain a new self-attention feature map:
The new self-attention feature map M new passes through three self-attention mechanism layers to obtain a coded candidate frame feature map M *∈Rd ×m;
in the step S30, the decoding process of the decoder module is as follows:
The input of the decoder module is a feature map H epsilon R d×n formed by a group of query vectors q epsilon R d×1 obtained through learning, and a coded candidate frame feature map M * obtained by the encoder module, wherein n is the number of the query vectors;
mapping the input of the decoder module into Q ', K ', V ' through different linear mapping layers to obtain the update of the H matrix:
The updated H new passes through three layers of self-attention mechanisms to obtain the decoded query vector feature map H *∈Rd×n.
2. The method for detecting a weakly supervised target of a remote sensing image based on a self attention mechanism as set forth in claim 1, wherein in the step S30, outputting the recognition result includes:
And predicting the position, size and category of the target in the image by using the decoded query vector set H *, and outputting a prediction result.
3. The method for detecting the weakly supervised target of the remote sensing image based on the self attention mechanism as set forth in claim 2, wherein the position, the size and the category of the target in the image are predicted by using the decoded query vector set H *, and a prediction result is output; comprising the following steps:
H * obtains a class result cls epsilon R n×1 of the query vector through a linear layer, wherein class 0 in the class result is set as a background class, and classes 1 to k are interesting target classes;
H * passes through a multi-layer perceptron to obtain the position size result obj E R n×4 of the query vector, wherein the position size result obj i=[cxi,cyi,wi,hi]∈R1×4 of the ith query vector is the central point coordinates and the length and width of the prediction frame on the x axis and the y axis respectively.
4. The method of claim 3, wherein in the step S20, the loss function used by the self-attention mechanism module part is:
Ltrans=λ1Lbbox2Lcls
wherein lambda 1、λ2 is the coefficient respectively;
Lbbox=λL1||bpred-btruth||1
L bbox represents an L1 loss function used by the candidate frame position information, b pred represents predicted size position information of the detection frame, including the length, width and position of a center point in the image, and b truth represents size position information of the detection frame of the pseudo tag generated by the algorithm;
L cls denotes the Focalloss penalty function used for class information, where α and γ are weight factors, control the shape of the loss curve, c pred denotes the class score of the predicted detection box, and c truth denotes the class information of the algorithm generated false label detection box.
5. The remote sensing image weak supervision target detection system based on the self-attention mechanism is characterized by comprising the following components:
the acquisition module is used for acquiring the training image, the candidate frames and the image-level annotation information corresponding to the training image;
The training module is used for inputting training images, candidate frames and labeling information into the recognition model for training; the identification model comprises: a candidate frame cluster learning module and a self-attention mechanism module; the self-attention mechanism module includes: an encoder module and a decoder module;
The detection module is used for inputting the image of the target to be identified and the candidate frame into the trained identification model, and outputting an identification result through the candidate frame cluster learning module, the encoder module and the decoder module in sequence; the identification result comprises: the corresponding target position, size and category in the image;
The coding flow of the coder module is as follows:
extracting high-dimensional characteristics F epsilon R d×1 and corresponding position and size codes P epsilon R d×1 for target candidate frames output by the candidate frame cluster learning module; r represents real space;
generating a feature map M epsilon R d×m according to the dimension d of the high-dimensional feature and the number M of target candidate frames;
mapping the feature map M into Q, K and V through different linear mapping layers to obtain a new self-attention feature map:
The new self-attention feature map M new passes through three self-attention mechanism layers to obtain a coded candidate frame feature map M *∈Rd ×m;
the decoding flow of the decoder module is as follows:
The input of the decoder module is a feature map H epsilon R d×n formed by a group of query vectors q epsilon R d×1 obtained through learning, and a coded candidate frame feature map M * obtained by the encoder module, wherein n is the number of the query vectors;
mapping the input of the decoder module into Q ', K ', V ' through different linear mapping layers to obtain the update of the H matrix:
The updated H new passes through three layers of self-attention mechanisms to obtain the decoded query vector feature map H *∈Rd×n.
CN202210524417.2A 2022-05-13 2022-05-13 Remote sensing image weak supervision target detection method and system based on self-attention mechanism Active CN114821331B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210524417.2A CN114821331B (en) 2022-05-13 2022-05-13 Remote sensing image weak supervision target detection method and system based on self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210524417.2A CN114821331B (en) 2022-05-13 2022-05-13 Remote sensing image weak supervision target detection method and system based on self-attention mechanism

Publications (2)

Publication Number Publication Date
CN114821331A CN114821331A (en) 2022-07-29
CN114821331B true CN114821331B (en) 2024-11-05

Family

ID=82514398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210524417.2A Active CN114821331B (en) 2022-05-13 2022-05-13 Remote sensing image weak supervision target detection method and system based on self-attention mechanism

Country Status (1)

Country Link
CN (1) CN114821331B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765921A (en) * 2019-10-18 2020-02-07 北京工业大学 Video object positioning method based on weak supervised learning and video spatiotemporal features
CN113378829A (en) * 2020-12-15 2021-09-10 浙江大学 Weak supervision target detection method based on positive and negative sample balance

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113902926B (en) * 2021-12-06 2022-05-31 之江实验室 General image target detection method and device based on self-attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765921A (en) * 2019-10-18 2020-02-07 北京工业大学 Video object positioning method based on weak supervised learning and video spatiotemporal features
CN113378829A (en) * 2020-12-15 2021-09-10 浙江大学 Weak supervision target detection method based on positive and negative sample balance

Also Published As

Publication number Publication date
CN114821331A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN107885764B (en) Rapid Hash vehicle retrieval method based on multitask deep learning
CN112084331B (en) Text processing and model training method and device, computer equipment and storage medium
CN107679250B (en) Multi-task layered image retrieval method based on deep self-coding convolutional neural network
CN108108657A (en) A kind of amendment local sensitivity Hash vehicle retrieval method based on multitask deep learning
CN110909673A (en) Pedestrian re-identification method based on natural language description
CN117011883A (en) Pedestrian re-recognition method based on pyramid convolution and transducer double branches
CN115131313A (en) Hyperspectral image change detection method and device based on Transformer
CN114913546A (en) Method and system for detecting character interaction relationship
CN115909280A (en) Traffic sign recognition algorithm based on multi-head attention mechanism
CN116580243A (en) Cross-domain remote sensing scene classification method for mask image modeling guide domain adaptation
CN115690549A (en) Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model
CN118314353B (en) Remote sensing image segmentation method based on double-branch multi-scale feature fusion
CN114821331B (en) Remote sensing image weak supervision target detection method and system based on self-attention mechanism
CN117893737A (en) Jellyfish identification and classification method based on YOLOv-LED
CN117829243A (en) Model training method, target detection device, electronic equipment and medium
CN117152504A (en) Space correlation guided prototype distillation small sample classification method
CN116524543A (en) Multi-mode unsupervised pedestrian re-identification method, device, equipment and storage medium
CN116524258A (en) Landslide detection method and system based on multi-label classification
CN115424275A (en) Fishing boat brand identification method and system based on deep learning technology
CN116992947A (en) Model training method, video query method and device
CN115934966A (en) Automatic labeling method based on remote sensing image recommendation information
CN115035455A (en) Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation
CN114168780A (en) Multimodal data processing method, electronic device, and storage medium
Liu et al. L2-LiteSeg: A Real-Time Semantic Segmentation Method for End-to-End Autonomous Driving
CN118410200B (en) Remote sensing image retrieval method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant