CN114037839B

CN114037839B - Small target identification method, system, electronic equipment and medium

Info

Publication number: CN114037839B
Application number: CN202111225109.1A
Authority: CN
Inventors: 彭建; 赵乙芳; 章登勇; 李峰
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2021-10-21
Filing date: 2021-10-21
Publication date: 2024-06-28
Anticipated expiration: 2041-10-21
Also published as: CN114037839A

Abstract

The invention discloses a small target identification method, a system, electronic equipment and a medium, which are characterized in that characteristics of a target image are extracted for multiple times to obtain a plurality of characteristic diagrams with different resolutions, semantics and position characteristics, and characteristics favorable for small target identification in the characteristic diagrams are fused to obtain a fused characteristic diagram with excellent dimensionality performances such as resolution, semantics and position, and the fused characteristic diagram improves the accuracy and performance of a convolutional neural network model on small target identification.

Description

Small target identification method, system, electronic equipment and medium

Technical Field

The invention relates to the technical field of deep learning image processing, in particular to a small target identification method, a system, electronic equipment and a medium.

Background

Object detection is a fundamental task in the field of computer vision and pattern recognition, specifically, to find objects of interest in images, determine their category and location. However, the problem of small targets has been a difficulty in visual tasks such as object detection and semantic segmentation, and the detection accuracy of small targets is usually only half that of large targets.

The detection accuracy of small targets is not as good as that of large targets, but is mainly due to the contradiction between resolution and semantic information. In the process of extracting features from an image, semantic information and position information must be sacrificed in order to obtain a high-resolution feature map, and resolution must be sacrificed in order to obtain a feature map with strong semantic information and position information. The outline of the small target is smaller and is usually positioned at the edge position of the image, the resolution, the semantic information and the position information are originally weaker, if the semantic information and the position information of the small target are enhanced, the resolution of the small target is weaker, the convolution neural network model is difficult to identify the small target, if the resolution of the small target is enhanced, the semantic information and the position information of the small target are weaker, and the convolution neural network model is also difficult to identify the small target.

Disclosure of Invention

The present invention aims to at least solve the technical problems existing in the prior art. Therefore, the invention provides a small target identification method, a small target identification system, electronic equipment and a medium, which can improve the performance of a convolutional neural network model on small target detection.

In a first aspect, an embodiment of the present invention provides a small target recognition method, including the following steps:

acquiring a target image input by a convolutional neural network model, wherein the target image contains a small target to be identified;

extracting characteristics of small targets in the target image to obtain an original characteristic diagram of the target image;

Extracting the features of the original feature map of the target image through a channel attention mechanism to obtain a semantic feature map focusing on small target semantic features, extracting the features of the original feature map of the target image through a space attention mechanism to obtain a position feature map focusing on small target position features, and fusing the original feature map of the target image, the semantic feature map and the position feature map to obtain a feature map after the target image is fused;

classifying small targets in the target image based on the feature images after the target image fusion to obtain the recognition result of the small targets.

According to the embodiment of the invention, at least the following technical effects are achieved:

The method comprises the steps of firstly extracting information of a target image, extracting the obtained original feature image with high resolution but weak semantic information and position information, extracting features of the original feature image by using a channel attention mechanism to obtain a semantic feature image focusing on small target semantic features, extracting features of the original feature image by using a space attention mechanism to obtain a position feature image focusing on small target position features, fusing the original feature image, the semantic feature image and the position feature image to make up for the short, obtaining the fused feature image with enhanced resolution, semantic information and position information, and classifying and identifying small targets in the target image by using a convolutional neural network based on the fused feature image, so that the identification accuracy of the convolutional neural network model on the small target identification can be improved.

According to some embodiments of the present invention, the extracting the features of the original feature map of the target image by the channel attention mechanism to obtain the semantic feature map focusing on the small target semantic features includes the steps of: carrying out global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and adding the average feature matrix and the maximum feature matrix according to phases; convolving the phase-added result with a convolution kernel of 1×1×c/r and activating with Relu nonlinear activation layers; and convolving the result after Relu nonlinear activation layers are activated by using a convolution kernel of 1 multiplied by C, and activating by using a sigmoid nonlinear activation function to obtain the semantic feature map focusing on the small target semantic features.

According to some embodiments of the present invention, the extracting the features of the original feature map of the target image by the spatial attention mechanism to obtain a location feature map focusing on the small target location feature includes the steps of: carrying out global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and combining the average feature matrix and the maximum feature matrix; and convolving the combined results, and activating by using a sigmoid nonlinear activation function to obtain the position feature map focusing on the small target position feature.

According to some embodiments of the present invention, the fusing the original feature map of the target image, the semantic feature map, and the location feature map to obtain a feature map after the fusion of the target image includes the steps of: multiplying the original feature map and the semantic feature map by phase twice, multiplying the original feature map and the position feature map by phase once, and adding the results obtained by phase multiplication by phase; and multiplying the bit-added result with the original feature map according to the phase to obtain the feature map after the fusion of the target image.

According to some embodiments of the invention, features of small objects in the object image are extracted by convolution or dilation convolution.

According to some embodiments of the invention, the recognition result of the small object includes: marking the category of the small target by using characters, marking the position of the small target by using a matrix, and masking the small targets of different categories by using different colors.

In a second aspect, an embodiment of the present invention provides a small target detection system, including:

The input module is used for acquiring a target image input by the convolutional neural network model, wherein the target image contains a small target to be identified;

the feature extraction module is used for extracting features of small targets in the target image to obtain an original feature map of the target image;

The feature fusion module is used for extracting the features of the original feature map of the target image through a channel attention mechanism to obtain a semantic feature map focusing on the small target semantic features, extracting the features of the original feature map of the target image through a space attention mechanism to obtain a position feature map focusing on the small target position features, and fusing the original feature map, the semantic feature map and the position feature map of the target image to obtain a feature map after the target image fusion;

And the output module is used for classifying the small targets in the target image based on the feature images after the target image fusion to obtain the recognition results of the small targets.

The feature extraction module performs primary extraction on information of a target image, the obtained original feature image is high in resolution but weak in semantic information and position information, the feature fusion module extracts features of the original feature image by using a channel attention mechanism to obtain a semantic feature image focusing on small target semantic features, extracts features of the original feature image by using a space attention mechanism to obtain a position feature image focusing on small target position features, fuses the original feature image, the semantic feature image and the position feature image to make up for the strong and weak, obtains a fused feature image with enhanced resolution, semantic information and position information, and the convolutional neural network classifies and identifies small targets in the target image based on the fused feature image, so that the identification accuracy of the convolutional neural network model on the small target identification can be improved.

In a third aspect, an embodiment of the present invention further provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing when executing the computer program:

the small object recognition method as claimed in the first aspect.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions for performing:

the small object recognition method as claimed in the first aspect.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a small target recognition method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of the structure of a SAFF module of the small object recognition method according to the embodiment of the present invention;

FIG. 3 is a schematic diagram of the internal structure of a channel attention branch of the small object recognition method according to the embodiment of the present invention;

fig. 4 is a schematic structural diagram of Stage6 of the small target recognition method according to the embodiment of the present invention;

FIG. 5 is a diagram of a DSAFF-Net structure of a small target recognition method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a small target detection system according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computer device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present invention.

Detailed Description

The following description of the technical solutions according to the embodiments of the present invention will be provided fully with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

With the continuous development of artificial intelligence application research, in recent years, research on deep learning has also been developed, and the research has become the most widely used technology in the field of computer vision, and has shown great advantages in the fields of image recognition, target detection, object tracking and the like.

The target detection task is a basic task in the fields of computer vision and pattern recognition, and is also one of core tasks. Its task is to find objects of interest in the image, determine their category and location, in other words to solve "where? What is the question? By "it provides reliable messages for multiple studies of target tracking, behavior recognition, scene understanding, etc. afterwards. Before deep learning occurs, the conventional target detection method is generally divided into three steps, and first, candidate regions are extracted by SELECTIVE SEARCH methods using sliding window frames of different sizes. And extracting relevant visual features of the candidate region, such as Harr features commonly used in face detection, HOG features commonly used in pedestrian detection and common target detection, and the like. And finally, classifying by using a trained classifier. The conventional target detection method has many defects such as low detection speed, low accuracy, poor real-time performance, large calculation amount and the like.

With the development of the deep learning technology in recent years, the target detection algorithm is also changed from the traditional algorithm based on manual characteristics to the detection technology based on the deep neural network. Convolutional neural networks are an important deep learning method. The LeNet-5 proposed by Yann LeCun in 1998 successfully applies a convolutional neural network to the field of image recognition for the first time, and has a good effect in letter recognition, and the development of the field of deep learning is greatly promoted by the appearance of the convolutional neural network. In 2012, alex Krizhevsky et al at university of Toronto set forth AlexNet neural network structure, which has a milestone significance in the study of convolutional neural network-based image processing, and has also drawn attention to convolutional neural networks. Currently, CNN-based object detectors can be divided into two categories: firstly, a single-stage detector is not required to search candidate areas separately, features are directly extracted from a network to predict the category and position regression of an object, and the single-stage object detection algorithm can be simply understood as 'one-step in place', and common one-stage object detection algorithms include YOLO, SSD, RETINENET and the like. The second is a two-stage (two-stage) detector, which needs to be implemented in two steps, firstly, a candidate region (a pre-selected frame possibly containing an object to be detected) is acquired, then classification and position regression are performed through a convolutional neural network by utilizing the characteristics of the candidate region, and common two-stage target detection algorithms include R-CNN, fasterR-CNN, mask R-CNN and the like. The one-stage detection algorithm has a speed advantage over the two-stage detection algorithm, and the two-stage target detection algorithm has a precision advantage over the one-stage detection algorithm. However, with continuous optimization of the target detection method, the accuracy and the speed of the target detection method are improved greatly.

Small target detection is widely used as part of target detection in imaging target images on large sites, at long distances, etc., for example, houses in aerial images of unmanned aerial vehicles, flowers in scenic figures, etc., and there are two ways for official definition of small targets: one is defined by the international organization SPIE, where a small object is one where the object area is less than 80 pixels in a 256×256 image, i.e., less than 0.12% of 256×256. The other is absolute size, and targets with sizes smaller than 32 x 32 pixels can be considered as small targets, as defined by the COCO dataset. The existing target detection algorithm based on the convolutional neural network is used for detecting a general data set, but in general, the occupation of a small target in an image is smaller, edge features are not obvious or even absent, and due to the limited resolution and semantic information, the effect of the small target detection algorithm based on deep learning on a conventional target detection data set is not good, and the small target detection is a widely and importantly research direction in the field of target detection research, so that a plurality of experts and scholars propose some optimization methods for small target detection. At present, ideas for improving small target detection mainly comprise data enhancement, feature fusion, utilization of context information, a proper training method, generation of denser anchors, application of ideas for generating an countermeasure network to amplify features and then detect targets, and the like.

It is well known that object detectors based on COCO datasets have far less detection performance for small objects than for large objects. This is caused by several reasons: 1) Characteristics of network feature extraction. Deep learning-based object detection networks typically use CNN networks as feature extraction tools. In the process of feature extraction, the CNN network continuously deepens the network layer number in order to obtain features with strong semantic information and large receptive field, but the size of the feature map continuously reduces along with the deepening of the network layer number, so that the area information with small area, namely the small target feature information, is difficult to transfer to the later stage of the target detector, the features of the small target are difficult to extract and disappear, and the detection performance of the small target detection is naturally poor. 2) Size in the target detection dataset target distribution is unbalanced. The COCO data set has unbalanced proportion of large and small targets, and the large targets are far more than the small targets, so that a deep learning-based target detection network is friendly to large target detection, and the network is difficult to adapt to targets with different sizes. 3) Network loss function. The network loss function is not friendly for small targets when positive and negative samples are selected.

The characteristics of high resolution and strong semantic information are of vital importance for the accuracy of target detection, both for target detection and for small target detection. But high resolution and strong semantic information like fish and bear's feet are not compatible. If the feature with strong semantic information is to be obtained, the number of downsampling is required to be increased, the receptive field is continuously expanded along with the increase of the number of downsampling, and the semantic information of each pixel point on the deep feature map is also continuously enhanced. However, if the feature map is to be kept to have high resolution, it is ensured that the features of the small object do not disappear and the obtained edge information of the small object is clear, the number of downsampling needs to be reduced, and at this time, the size of the receptive field cannot be ensured, and a larger amount of calculation is required and more memory is occupied.

In the Mask R-CNN network, resNet-FPN results [ P2, P3, P4, P5, P6] are used as inputs of an RPN network layer, and a feature map P6 is obtained from the feature map P5 through maximum pooling downsampling with a step length of 2, and is only used for obtaining region proposal in the RPN network layer, so that larger anchor size 512 x 512 is obtained. Direct downsampling, however, expands receptive fields, but can result in parameters that cannot be learned, and can lose a portion of spatial resolution, which is detrimental to accurately locating large targets and identifying small targets.

In order to solve one of the above problems, referring to fig. 1, an embodiment of the present invention provides a small object recognition method, including the steps of:

Step S110: acquiring a target image input by a convolutional neural network model, wherein the target image contains a small target to be identified;

step S120: extracting characteristics of small targets in the target image to obtain an original characteristic diagram of the target image;

Step S130: extracting the features of the original feature map of the target image through a channel attention mechanism to obtain a semantic feature map of the important small target semantic feature, extracting the features of the original feature map of the target image through a space attention mechanism to obtain a position feature map of the important small target position feature, and fusing the original feature map, the semantic feature map and the position feature map of the target image to obtain a feature map after the target image fusion;

Step S140: classifying small targets in the target image based on the feature images after the target image fusion to obtain the recognition result of the small targets.

The embodiment is based on a Mask R-CNN convolutional neural network, and improves a feature extraction part of the convolutional neural network under the condition that a Mask R-CNN convolutional neural network backbone is maintained to be basically unchanged. Firstly, a three-way feature attention fusion module-Sandwich Attention Feature Fusion Module (SAFF module) is designed, semantic information of shallow features and resolution of deep features are enhanced, and feature fusion is effectively carried out by combining an FPN network structure, so that accuracy of target classification and position regression, particularly small targets, is improved. Second, a new neural network layer is created in the backbone network by dilation convolution, alleviating the disadvantage of losing resolution when the sampling layer enlarges the receptive field.

In step S110 to step S120, the target image containing the small target information to be detected is input into a trained convolutional neural network model, the convolutional neural network model will extract the information of the target image, a part of relevant small target features to be detected are reserved, and a part of irrelevant features are discarded at the same time, so as to obtain the original feature map.

Because the semantic information and the position information of the original feature map are very weak, the convolutional neural network cannot effectively distinguish small targets in the target image according to the original feature map, so that the semantic information and the position information of the original feature map need to be enhanced. In step S130, the features of the original feature map of the target image are extracted by the channel attention mechanism to obtain a semantic feature map focusing on the small target semantic features, and the features of the original feature map of the target image are extracted by the spatial attention mechanism to obtain a position feature map focusing on the small target position features, so that the semantic feature map and the position feature map can just make up the weak points of the semantic information and the position information of the original feature map. And finally, fusing the original feature map, the semantic feature map and the position feature map of the target image to obtain a fused feature map with enhanced resolution, semantic information and position information.

In step S140, the convolutional neural network classifies the small targets in the target image based on the feature map after the fusion of the target image, and obtains the recognition result of the small targets.

According to the embodiment of the invention, the information of the target image is extracted for the first time, the obtained original feature image is high in resolution but weak in semantic information and position information, the features of the original feature image are extracted by using a channel attention mechanism, the semantic feature image of the important small target semantic features is obtained, the features of the original feature image are extracted by using a space attention mechanism, the position feature image of the important small target position features is obtained, the original feature image, the semantic feature image and the position feature image are fused to make up for the strong and weak, the enhanced fused feature image on the resolution, the semantic information and the position information is obtained, the convolutional neural network classifies and identifies the small targets in the target image based on the fused feature image, and the identification accuracy of the convolutional neural network model on the small target identification can be improved.

In some alternative embodiments, the target image may be convolved with a dilation convolution technique at the time of feature extraction. The dilation convolution is to inject holes based on the feature map of the standard convolution to increase the receptive field. The expansion convolution allows the convolution output to contain a larger range of information, avoiding unnecessary loss of a portion of the information.

Taking Mask RCNN models in convolutional neural networks as an example, the implementation of the above embodiments is specifically described.

The embodiment of the invention improves the backbone network of the feature extraction on a two-stage detector Mask R-CNN, and mainly researches two aspects. Firstly, a three-way feature attention fusion module-Sandwich Attention Feature Fusion Module (SAFF module) is designed, semantic information of shallow features and resolution of deep features are enhanced, feature fusion is effectively carried out by combining a feature map pyramid network structure (Feature Pyramid Networks, FPN), and therefore accuracy of target classification and position regression is improved, and particularly small targets are achieved. Second, the disadvantage of losing resolution when the sampling (pooling) enlarges the receptive field is alleviated by expanding the convolution to create a new number of neural network layers (stages) in the backbone network.

1、SAFF module

Referring to fig. 2 and 3,Sandwich Attention Feature Fusion Module (SAFF module), a three-way feature attention fusion module is formed by alternately superimposing two channel attention mechanisms and one spatial attention mechanism. The purpose is to strengthen the semantic information of shallow layer characteristics, improve the resolution ratio of deep layer characteristics, and optimize the detection performance of small targets.

And the channel attention mechanism automatically acquires the importance degree of each characteristic channel in a learning mode, then makes a judgment according to the acquired importance degree, strengthens useful characteristics and weakens the characteristics with little use for the current task. First, a feature map X (whose dimensions are h×w×c, where H and W denote the height and width of the feature map, respectively, and C denote the number of channels of the input feature map) is input, and two channel features of 1×1×c are obtained by global average pooling (Global average pooling, GAP) and maximum pooling (Max pooling), respectively. Briefly, GAP operation represents global pooling of each feature map, compressing global information in the feature map into a real number that has a receptive field of global information to some extent and can directly give each channel the actual class meaning. At the same time, GAP also greatly reduces network parameters. Maximum pooling, dividing the feature map into blocks, and taking the maximum value of each block, which means that the relatively strongest information in the feature map is extracted, and other weaker information is discarded to enter the next layer.

Then, the GAP and the feature matrix obtained by the maximum pooling are added by bits, and the added result is input to the next convolution layer. The first convolution kernel is 1×1×C/r (r represents the channel compression ratio), which compresses the channel to its original size C/r, reducing the dimension of the feature map. The second convolution kernel is 1 x C in size, restoring the number of channels of the feature map to the original size. The operation of realizing the dimension reduction and the dimension increase by using 1*1 convolution kernels is actually the linear combination change of the information between channels, so as to form the interaction of the information between the channels. A Relu nonlinear activation layer is arranged between the two convolutions, nonlinearity is added, and the expression capacity of the network is improved. Finally, a new zoomed feature map is obtained through a sigmoid nonlinear activation function.

The processed output characteristics can be formulated as:

C(x)＝δ(Conv(σ(Conv(GAP(x)+MaxPool(x)))))

Wherein sigma represents Relu nonlinear activation functions, and delta represents Sigmoid nonlinear activation functions.

The output C (X) and the original input feature map are multiplied by each other according to the phase through the far jump connection to obtain a fused feature F1 (X), wherein the fused feature F1 (X) can be expressed by the following formula, and the X represents the original input feature map.

F1(x)＝C(x)×X

The spatial attention mechanism is different from the channel attention mechanism, and focuses on the position information of the enhanced features, and is complementary to the channel attention. Pooling is first done based on channels using two different approaches, global average pooling (Global average pooling, GAP) and global maximum pooling (Global max pooling, GMP), resulting in two feature maps of the same dimension. Then, the two feature maps are combined (concat) based on the channels, and the channel numbers of the two feature maps are combined to obtain a special feature map. Then, dimension reduction operation is carried out on the feature map, and then a spatial matrix added with spatial attention weight is obtained through a sigmoid nonlinear activation function. And finally, multiplying the obtained corresponding phases of the space attention moment arrays to the original characteristic diagram to obtain a new characteristic layer F2 (x) after space reinforcement, wherein the new characteristic layer F2 (x) can be expressed by the following formula.

S(x)＝δ(Conv(GAP(x)；MaxPool(x)))

F2(x)＝S(x)×X

After the feature map passes Sandwich Attention Feature Fusion Module, the feature map X' which is finally enhanced by the channel information and the spatial information can be expressed by the following formula:

X′＝X×(2F1(x)+F2(x))

2. stage6 module

Referring to fig. 4, mask RCNN consists of 5 stages using ResNet as a backbone network to extract features. Each stage consists of a different number of convolutional layers, including two residual blocks, namely a convolutional module (Convolutional Block) and an Identity Block (Identity Block), wherein the input and output dimensions of the feature map through Convolutional Block are different, and the feature map can change the dimension of the network but cannot be connected in series with the network layers. And through the characteristic diagram of the Identity Block, the input and output dimensions are the same, and the network layers can be connected in series for deepening the layer number of the network. The feature images output by Stage1-5 are C1, C2, C3, C4 and C5 respectively, and the corresponding sizes are 1/2,1/4,1/8,1/16 and 1/32 of the original sizes respectively.

The construction of the feature pyramid FPN realizes the fusion of multiple scales of features, and in Mask R-CNN, P2, P3, P4, P5 and P6 are obtained as effective feature layers of an RPN network to obtain a prediction frame after an input picture is processed by ResNet-FPN feature extraction network. Each P layer processes single scale information, specifically, for the anchors (anchors) of the five scales of {32×32, 64×64, 128×128, 256×256, 512×512}, the anchors are respectively corresponding to the five feature layers of { P2, P3, P4, P5, P6}, and each feature layer processes candidate frames of three aspect ratios of 1:1, 1:2, 2:1.

Compared with ResNet, the network of the embodiment of the invention reserves Stage1-5 in ResNet101, and adds a Stage6 module which consists of two basic block expansion identification modules (DILATED IDENTITY blocks) and an expansion convolution module (dilated convolutional block). After the feature map passes through the Stage6 module, the output value is P6, and the output size is 1/64 of the original map size. P6 in the original network is designed specifically for the RPN network, does not participate in the structural layer behind the network, and is used for processing candidate frames with the size of 512 x 512, and is directly obtained by P5 through downsampling. The feature layer obtained through downsampling can increase the receptive field and enable convolution to obtain more information, but only leaves the characteristic that the feature layer considers important information in the process of dimension reduction, and part of information is lost, so that pooling is used for increasing the receptive field, on the premise that some information is lost, resolution is reduced, and the accuracy of regression of the final target position is affected to a certain extent. By using dilated technology, the receptive field can be enlarged without pooling, so that the convolution output contains a larger range of information, and unnecessary loss of a part of information is avoided.

The DSAFF-Net feature extraction backbone details are shown in table 1:

TABLE 1

With reference to fig. 5, the following describes in further detail a specific implementation of an embodiment of the present invention with reference to the accompanying drawings:

1) Data set preparation, using the MS coco public data set as the subject, which included a training set and a test set.

2) Building a target detection network model based on two stages

2.1 Inputting a picture, and forming a new feature extraction network structure through ResNet +FPN+SAFF module+stage 6 to obtain a feature layer after feature extraction;

2.2 Extracting candidate frames through an RPN network, wherein the characteristic layer is used for processing the candidate frames with the size of 512 x 512, the original network is directly obtained by P5 through downsampling, and the invention is obtained by Stage6 consisting of DILATED IDENTITY blocks and dilated convolutional block;

2.3 Scaling the target candidate boxes obtained in 2.2 to a uniform size using ROIAlign (a region feature aggregation mode);

2.4 Classifying these regions of interest (ROIs), bounding Box regression, and MASK generation (FCN operations within each ROI).

In a second aspect, referring to fig. 6, an embodiment of the present invention provides a small object detection system, including an input module 210, a feature extraction module 220, a feature fusion module 230, and an output module 240, where:

The input module 210 is configured to obtain a target image input by the convolutional neural network model, where the target image contains a small target to be identified; the feature extraction module 220 is used for extracting features of small targets in the target image to obtain an original feature map of the target image; the feature fusion module 230 is configured to extract features of an original feature map of the target image through a channel attention mechanism to obtain a semantic feature map focusing on small target semantic features, extract features of an original feature map of the target image through a spatial attention mechanism to obtain a position feature map focusing on small target position features, and fuse the original feature map, the semantic feature map and the position feature map of the target image to obtain a feature map after target image fusion; the output module 240 is configured to classify small objects in the target image based on the feature map after the fusion of the target image, so as to obtain a recognition result of the small objects.

According to the embodiment of the invention, the feature extraction module 220 performs primary extraction on the information of the target image, the obtained original feature image is high in resolution but weak in semantic information and position information, the feature fusion module 230 extracts features of the original feature image by using a channel attention mechanism to obtain a semantic feature image focusing on small target semantic features, extracts features of the original feature image by using a space attention mechanism to obtain a position feature image focusing on small target position features, fuses the original feature image, the semantic feature image and the position feature image, and takes the advantages of the strong and weak effects, so that the resolution, the semantic information and the position information are enhanced, the small targets in the target image are classified and identified by the convolutional neural network based on the fused feature image, and the identification accuracy of the convolutional neural network model on the small target identification can be improved.

In addition, referring to fig. 7, the present application also provides a computer device 301, including: memory 310, processor 320, and computer program 311 stored on memory 310 and executable on the processor, processor 320 implementing when executing computer program 311:

such as the small object recognition method described above.

The processor 320 and the memory 310 may be connected by a bus or other means.

Memory 310 acts as a non-transitory computer readable storage medium that may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, memory 310 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, memory 310 may optionally include memory located remotely from the processor to which the remote memory may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software program and instructions required to implement the small-target recognition method of the above-described embodiment are stored in the memory, and when executed by the processor, perform the small-target recognition method of the above-described embodiment, for example, perform the method steps S110 to S140 in fig. 1 described above.

In addition, referring to fig. 8, the present application also provides a computer-readable storage medium 401 storing computer-executable instructions 410, the computer-executable instructions 410 for performing:

such as the small object recognition method described above.

The computer-readable storage medium 401 stores computer-executable instructions 410, where the computer-executable instructions 410 are executed by a processor or controller, for example, by a processor in the above-described electronic device embodiment, and may cause the processor to perform the small-object recognition method in the above-described embodiment, for example, to perform the method steps S110 to S140 in fig. 1 described above.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of data such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired data and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any data delivery media.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method for identifying a small object, comprising the steps of:

Classifying small targets in the target image based on the feature images after the target image fusion to obtain a recognition result of the small targets; the method for extracting the features of the original feature map of the target image through the channel attention mechanism to obtain a semantic feature map focusing on the small target semantic features comprises the following steps:

Carrying out global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and adding the average feature matrix and the maximum feature matrix according to phases;

Convolving the phase-added result with a convolution kernel of 1×1×c/r and activating with Relu nonlinear activation layers;

Convolving the result after Relu nonlinear activation layers are activated by using a convolution kernel of 1 multiplied by C, and activating by using a sigmoid nonlinear activation function to obtain the semantic feature map focusing on small target semantic features;

Extracting the features of the original feature map of the target image through a spatial attention mechanism to obtain a position feature map focusing on the small target position features, wherein the method comprises the following steps:

Carrying out global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and combining the average feature matrix and the maximum feature matrix;

And convolving the combined results, and activating by using a sigmoid nonlinear activation function to obtain the position feature map focusing on the small target position feature.

2. The small target recognition method according to claim 1, wherein the fusing the original feature map, the semantic feature map and the location feature map of the target image to obtain the feature map after the fusion of the target image comprises the steps of:

multiplying the original feature map and the semantic feature map by phase twice, multiplying the original feature map and the position feature map by phase once, and adding the results obtained by phase multiplication by phase;

And multiplying the bit-added result with the original feature map according to the phase to obtain the feature map after the fusion of the target image.

3. The small object recognition method according to claim 1, wherein: and extracting the characteristics of small targets in the target image through convolution or expansion convolution.

4. The small object recognition method according to claim 1, wherein: the recognition result of the small target comprises: marking the category of the small target by using characters, marking the position of the small target by using a matrix, and masking the small targets of different categories by using different colors.

5. A small object detection system, comprising:

the feature fusion module is used for extracting the features of the original feature map of the target image through a channel attention mechanism to obtain a semantic feature map focusing on the small target semantic features, extracting the features of the original feature map of the target image through a space attention mechanism to obtain a position feature map focusing on the small target position features, and fusing the original feature map, the semantic feature map and the position feature map of the target image to obtain a feature map after the target image fusion; the method for extracting the features of the original feature map of the target image through the channel attention mechanism to obtain a semantic feature map focusing on the small target semantic features comprises the following steps:

Carrying out global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and adding the average feature matrix and the maximum feature matrix according to phases; convolving the phase-added result with a convolution kernel of 1×1×c/r and activating with Relu nonlinear activation layers; convolving the result after Relu nonlinear activation layers are activated by using a convolution kernel of 1 multiplied by C, and activating by using a sigmoid nonlinear activation function to obtain the semantic feature map focusing on small target semantic features;

Carrying out global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and combining the average feature matrix and the maximum feature matrix; convolving the combined results, and activating by using a sigmoid nonlinear activation function to obtain the position feature map focusing on the small target position feature;

6. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing when executing the computer program:

the small object recognition method according to any one of claims 1 to 4.

7. A computer-readable storage medium storing computer-executable instructions for performing:

the small object recognition method according to any one of claims 1 to 4.