CN115131551A

CN115131551A - Target feature extraction method based on cross-correlation self-attention mechanism

Info

Publication number: CN115131551A
Application number: CN202210778826.5A
Authority: CN
Inventors: 袁帅; 许景科; 栾方军; 张笑闻
Original assignee: Shenyang Jianzhu University
Current assignee: Shenyang Jianzhu University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-30

Abstract

The invention relates to the technical field of target detection and identification, and discloses a target feature extraction method based on a cross-correlation self-attention mechanism, which specifically comprises the steps of firstly inputting a feature diagram X with the size of H X W X C; then, carrying out window division operation on the feature graph; expanding the channel dimension of the linear layer to be 2 x C, and dividing the matrix into a matrix M and a matrix V along the channel dimension; acquiring a cross-correlation matrix and activating operation; then self-attention calculation and channel attention calculation are carried out; and finally, outputting a feature diagram Y with the size H W C. The invention searches the correlation among elements in the characteristic diagram, obtains the similar characteristics of the target, and simultaneously makes information sharing among channels, thereby realizing attention area selection of space dimension and channel dimension. The method and the device improve the recognition effect of the model on the information to be detected in the image and improve the recognition precision of the model.

Description

Target feature extraction method based on cross-correlation self-attention mechanism

Technical Field

The invention relates to the technical field of target detection and identification, in particular to a target feature extraction method based on a cross-correlation self-attention mechanism.

Background

With the development of deep learning, image processing methods based on the basis of convolutional neural networks are becoming mainstream. In deep learning, with the rapid increase of data information amount, how to focus limited computing power on a target area by using an attention mechanism becomes a current research hotspot. There are many studies to integrate attention mechanism into feature extraction, firstly, the convolutional neural network can use the attention mechanism to automatically calculate the feature region needing to be highlighted through learnable weight, and secondly, to imitate the attention behavior of human, find the focus region in the image.

Attention mechanisms can be classified into three categories, namely spatial attention mechanisms, channel attention mechanisms and self-attention mechanisms. Hou et al propose a Coding Attention (CA) mechanism that processes a feature map in spatial dimensions, performs pooling operations from two dimensions, respectively, and can capture long-range dependence and precise location of information. Wang et al first performs pooling operations on the feature map, and then performs one-dimensional convolution operations on the channel dimensions of the feature map to obtain the interrelations between the channels. To combine the spatial attention mechanism with the channel attention mechanism, Woo et al propose a CBAM module that first performs weight assignment in the channel dimension, and then performs target search in the spatial dimension. The Transformer model applied the self-attention mechanism to the natural language processing domain for the first time, and then extended the self-attention mechanism to the computer vision domain by the ViT model. The self-attention mechanism can find the position of the target through the connection between the pixels of the image, and can capture global information at one time. It can be seen that the spatial attention mechanism and the channel attention mechanism based on convolution operation do not have the characteristic of capturing global information by the self-attention mechanism, but the self-attention mechanism cannot enable information interaction between channels. Based on the advantages and disadvantages of the existing attention mechanism, the invention provides a target feature extraction method based on a cross-correlation self-attention mechanism.

Disclosure of Invention

The invention aims to provide a target feature extraction method based on a cross-correlation self-attention mechanism; the method utilizes the capability of capturing global information of the traditional self-attention mechanism, realizes information interaction among channels through the channel attention mechanism, uses a cross-correlation matrix to find out similar information in a characteristic diagram on the basis, and further finds out an attention area in an image, thereby realizing efficient and accurate target identification.

The invention is realized by the following steps: a target feature extraction method based on a cross-correlation self-attention mechanism; the method is specifically executed according to the following steps;

S ₁ : firstly, inputting a characteristic diagram X with the size of H, W and C;

S ₂ : carrying out window division operation on the characteristic diagram; dividing the input tensor into a group of n pixels along the length and width directions, wherein the size of each window is n x n; finally, the three-dimensional tensor in the window is expanded from H x W C to HW 1C.

S ₃ : the dimension of the expansion channel of the linear layer is 2 × C, and the matrix is divided into a matrix M and a matrix V along the dimension of the channel; the method is specifically implemented according to the following steps;

expanding the channel dimension to 2 × C by linear layers, and dividing the matrix into a matrix M and a matrix V along the channel dimension, wherein each column of the M matrix and the V matrix is as shown in formula (1) and formula (2):

where i ∈ [0, C ], C represents the number of channels of the matrix.

S ₄ : acquiring a cross-correlation matrix; the method is specifically carried out according to the following steps,

S _4.1 : copying each column vector in the M matrix into H x W columns, copying the matrix with the size of (H x W) obtained after copying into two parts, and copying one of the two partsTransposing, subtracting the transposed matrix from the copied matrix to obtain the difference between each element in the M matrix and other elements, wherein the matrix is positioned as M _dis ；

S _4.2 : will M _dis Adding the elements at the corresponding positions of each channel in the matrix to obtain a molecular matrix,

S _4.3 : defining the denominator matrix as

The expression is shown in formula (3):

S _4.4 : dividing the numerator matrix into denominator matrix to obtain a similarity matrix M _mask The calculation formula is shown as formula (4);

S _4.5 : will matrix M _mask The convolution operation is performed using a convolution kernel of size 1 x 1.

S ₅ : activating operation; activating the convolved tensor by using an activation function, and defining the activation function as a formula (5);

M ^＊ sofimax (1-sigmoid (x)) formula (5)

Wherein X is the input tensor, M ^＊ Is the matrix after passing the activation function.

S ₆ : performing self-attention calculation; the matrix M after activation ^＊ Performing product operation with the matrix V; rearranging the computed tensor result into H W C; then multiplying the rearranged result by the channel weight obtained by the channel attention mechanism according to the corresponding channel;

S ₇ : performing channel attention calculation;

S ₈ : and outputting a feature graph Y with the size H, W and C.

Further, adding a channel attention mechanism, namely firstly, performing average pooling on the input feature graph H W C to obtain a feature graph with the size of 1W 1C; performing convolution operation on the feature map by using a one-dimensional convolution kernel with the size of 3 x 1; activating the feature value after convolution through a Sigmoid function, wherein the specific implementation formula is as shown in formula (6):

E(X)＝Sigmoid(C ^3×1 (Avgpool (X))) formula (6)

Wherein X represents the input characteristic, C ^3×1 The convolution operation with size 3 x 1 is indicated.

Further, the network model operated is YOLOv5, and the operation steps are as follows:

S _8.1 : acquiring a data set, and performing Mosaic data enhancement on the data set; sending the enhanced data into a network for training;

S _8.2 : the method for extracting the target features based on the cross-correlation self-attention mechanism is applied to a YOLOv5 network, and the last three C3 modules in the neck structure are replaced by the following steps:

S _8.2.1 : the input tensor is copied into 2 parts and is processed through two branches respectively;

S _8.2.2 : one of the branches is subjected to 1-by-1 convolution and a modified self-attention mechanism; convolving the other branch by 1 x 1;

S _8.2.3 : performing concat operation on output results of the two branches, splicing along channel dimensions, and performing 1 × 1 convolution operation;

S ₉ : the optimization algorithm adopts a random gradient descent algorithm SGD as an optimizer, takes 16 pictures as a training batch, the initial learning rate of the model is 1e-2, the weight attenuation parameter is 5e-4, the momentum is 0.937, and 300 epochs are trained. In the initial stage of model training, 3 epochs are adopted for warm-up training;

S ₁₀ : and predicting the picture after the model is trained to obtain a result.

Further, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a master controller, implements a method as claimed in any one of the above.

Compared with the prior art, the invention has the beneficial effects that:

and searching the correlation among elements in the characteristic diagram, acquiring the similar characteristics of the target, sharing the information among channels, and realizing the attention area selection of the space dimension and the channel dimension. The method and the device improve the recognition effect of the model on the information to be detected in the image and improve the recognition precision of the model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flow chart of a model of the present invention.

Fig. 2 is a block diagram of the operation of the present invention in the YOLOv5 network.

FIG. 3 is a block diagram of the cross-correlation self-attention mechanism of the present invention.

FIG. 4 is a flow chart of the cross-correlation self-attention mechanism of the present invention.

Fig. 5 is a block diagram of the channel attention operation of the present invention.

FIG. 6 is a schematic diagram of a molecular matrix in the cross-correlation self-attention mechanism of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive efforts based on the embodiments of the present invention, are within the scope of protection of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Referring to fig. 1-6, a method for extracting target features based on a cross-correlation self-attention mechanism, where a network model is YOLOv5, and modules of YOLOv5 used in the present invention include:

1. a Focus module: expanding four times of three channels of the RGB image into twelve channels, and performing convolution operation to obtain a feature map of double down sampling;

2. a Conv module: the method comprises two-dimensional convolution, batch normalization and an activation function;

3. an SPP module: performing maximal pooling operations of 5 × 5, 9 × 9 and 13 × 13 on the input images respectively, and obtaining fused features through convolution;

4. module C3: the system comprises three Conv modules and a Bottleneeck structure, wherein one branch passes through the Conv modules and the Bottleneeck modules, the other branch passes through the Conv modules, and the results of the two branches are subjected to Concat operation and pass through the last Conv module;

5. cross-correlation C3 module: the Bottleneck structure is replaced by a cross-correlation self-attention mechanism on the basis of a C3 module.

In this embodiment, after the present invention is combined with YOLOv5 network, the operation steps are as follows:

S ₁ acquiring a data set, dividing the data, and performing Mosaic data enhancement on the data set.

S ₂ Building a Yolov5 network model.

S ₃ Applying a target feature extraction method based on a cross-correlation self-attention mechanism to a YOLOv5 network, and connecting the last three in the neck structureThe C3 module is replaced by the following steps:

S _3.1 the input tensor is copied into 2 parts and processed through two branches respectively.

S _3.2 One of the branches is subjected to a 1 x 1 convolution and a modified self-attention mechanism.

S _3.3 The other branch is convolved by 1 x 1.

S _3.4 The output results of the two branches are subjected to concat operation, spliced along the channel dimension, and subjected to 1 × 1 convolution operation.

S ₄ The processed data set is sent to a model for training, and the model result is detected through a test image.

S ₅ The optimization algorithm adopts a random gradient descent algorithm (SGD) as an optimizer, takes 16 pictures as a training batch, the initial learning rate of the model is 1e-2, the weight attenuation parameter is 5e-4, the momentum is 0.937, and 300 epochs are trained. In the initial stage of model training, 3 epochs are used for warm-up training. The invention uses PyTorch frame to build, and uses Intel Xeon Gold 5320CPU @2.20GHz and NVIDIA RTX A4000 GPU, and the system is Ubuntu 18.04.

S ₆ Comparing the original Yolov5 model with a Yolov5 model added in the method provided by the invention, and comparing the test effects of two networks, wherein the results are shown in Table 1:

TABLE 1 comparison of the results

	Precision(％)	Recall(％)	AP(％)
				YOLOv5	78.6	71.2	73.3
Proposed method	79.2	76.9	77.1

The meaning of the precision index is what the model predicts is of all targets, namely the targets. The recall index indicates how much was successfully predicted in the model for all real targets. The average precision can balance two indexes of precision rate and recall rate, the recall rate is used as an abscissa, the precision rate is used as an ordinate, and the area size under a curve (PR-curve) enclosed by the two parameters is calculated.

By completing the steps, efficient and accurate target identification can be realized, and the target prediction accuracy can be improved.

In this embodiment, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a host controller, implements the method of any one of the above.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target feature extraction method based on a cross-correlation self-attention mechanism is characterized by comprising the following steps:

S ₁ firstly, inputting a characteristic diagram X with the size of H, W and C;

S ₂ carrying out window division operation on the characteristic diagram;

S ₃ the dimension of an expansion channel of the linear layer is 2 × C, and the matrix is divided into a matrix M and a matrix V along the dimension of the channel;

S ₄ obtaining a cross-correlation matrix;

S ₅ activating operation;

S ₆ performing self-attention calculation;

S ₇ performing channel attention calculation;

S ₈ outputting a characteristic diagram Y with the size of H, W, C.

2. The method for extracting target features based on cross-correlation self-attention mechanism as claimed in claim 1, wherein step S ₂ The method comprises the following steps:

S _2.1 dividing the input tensor into a group every n pixels along the length and width directions, wherein the size of each window is n x n;

S _2.2 extending the three-dimensional tensor in the window from H W C to HW 1C.

3. The method for extracting target features based on the cross-correlation self-attention mechanism as claimed in claim 1, wherein step S ₃ The method comprises the following steps:

expanding the channel dimension by a linear layer by 2 × C; partitioning the matrix along the channel dimension into a matrix M and a matrix V, wherein each column of the M and V matrices is as in equations (1) and (2):

wherein i ∈ [0, C ], C represents the number of channels of the matrix.

4. The method for extracting target features based on cross-correlation self-attention mechanism as claimed in claim 1, wherein step S ₄ The method comprises the following steps:

S _4.1 copying each column vector in M matrix into H x W column, copying matrix with size of (H x W) into two copies, transposing one copy, subtracting the copied matrix from its transposed matrix to obtain difference between each element and other elements in M matrix, and defining the matrix as M _dis ；

S _4.2 Will M _dis The elements at the corresponding position of each channel in the matrix are added, and the obtained molecular matrix is

S _4.3 Defining the denominator matrix as

The expression is shown in formula (3):

S _4.4 dividing the numerator matrix into denominator matrix to obtain a similarity matrix M _mask The calculation formula is shown as formula (4);

S _4.5 will matrix M _ask The convolution operation is performed using a convolution kernel of size 1 x 1.

5. The method for extracting target features based on the cross-correlation self-attention mechanism as claimed in claim 1, wherein in step S ₅ In the middle, specifically pressThe following steps are executed; activating the convolved tensor by using an activation function, and defining the activation function as a formula (5);

M ^＊ softmax (1-sigmoid (X)) formula (5)

6. The method for extracting target features based on the cross-correlation self-attention mechanism as claimed in claim 1, wherein the step S ₆ The method comprises the following steps: the activated matrix M ^＊ Performing product operation with the matrix V; rearranging the computed tensor result into H W C; and then multiplying the rearranged result by the channel weight obtained by the channel attention mechanism by the corresponding channel.

7. The method for extracting the target features based on the cross-correlation self-attention mechanism according to any one of claims 1 or 6, wherein a channel attention mechanism is added, and firstly, the input feature graph H W C is subjected to average pooling to obtain a feature graph with the size of 1C;

performing convolution operation on the feature map by using a one-dimensional convolution kernel with the size of 3 x 1; activating the feature value after convolution through a Sigmoid function, wherein the specific implementation formula is as shown in formula (6):

E(X)＝Sigmoid(C ^3×1 (Avgpool (X))) formula (6)

8. The method for extracting target features based on the cross-correlation self-attention mechanism as claimed in claim 1, wherein the network model is YOLOv5, and the method comprises the following steps:

S _8.1 acquiring a data set, and performing Mosaic data enhancement on the data set; sending the enhanced data into a network for training;

S _8.2 an objective of a cross-correlation-based self-attention mechanismThe target feature extraction method is applied to a YOLOv5 network, and the last three C3 modules in the neck structure are replaced, wherein the replacement method comprises the following steps:

S _8.2.1 copying the input tensor into 2 parts, and processing the parts through two branches respectively;

S _8.2.2 one of the branches is subjected to 1 x 1 convolution and an improved self-attention mechanism; convolving the other branch by 1 x 1;

S _8.2.3 performing concat operation on output results of the two branches, splicing along channel dimensions, and performing 1 × 1 convolution operation;

S _8.3 an optimization algorithm adopts a random gradient descent algorithm SGD as an optimizer, 16 pictures are used as a training batch, the initial learning rate of a model is 1e-2, the weight attenuation parameter is 5e-4, the momentum is 0.937, and 300 epochs are trained. In the initial stage of model training, 3 epochs are adopted for warm-up training;

S _8.4 after the model is trained, the picture is predicted to obtain a result.

9. A computer-readable storage medium, on which a computer program is stored, which, when executed by a master controller, implements the method of any one of claims 1-8.