Open AccessArticle

TGC-YOLOv5: An Enhanced YOLOv5 Drone Detection Model Based on Transformer, GAM & CA Attention Mechanism

Yuliang Zhao

^1,2

Zhongjie Ju

Tianang Sun

Fanghecong Dong

¹,

Jian Li

Ruige Yang

¹,

Qiang Fu

^3,*,

Chao Lian

^1,*

and

Peng Shan

^1,*

School of Information Science and Engineering, Northeastern University, Shenyang 110819, China

Hebei Key Laboratory of Micro-Nano Precision Optical Sensing and Measurement Technology, Qinhuangdao 066004, China

Shijiazhuang Campus of Army Engineer University, Shijiazhuang 050003, China

Authors to whom correspondence should be addressed.

Drones 2023, 7(7), 446; https://doi.org/10.3390/drones7070446

Submission received: 8 June 2023 / Revised: 30 June 2023 / Accepted: 4 July 2023 / Published: 6 July 2023

Download

Browse Figures

Figure 1
SUAV-DATA. "> Figure 2
Drone templates. "> Figure 3
(a) Length and width distribution statistics of large, medium, and small drones in the SUAV-DATA dataset; (b) Statistical data of the number of large, medium, and small drones in the SUAV-DATA dataset. "> Figure 4
Frame diagram of TGC-YOLOv5. "> Figure 5
Feature enhancement performance graph of the ablation experimental model. NOTE: Figure (a) depicts the original input image, while figures (b–e) represent the heatmaps of head features obtained from models trained with four different algorithms: YOLOv5, YOLOv5 + Transformer, YOLOv5 + Transformer + GAM, and TGC-YOLOv5, respectively, for the same input image. "> Figure 6
Frame of Transformer Encoder Block. "> Figure 7
Frame of Global Attention Mechanism. "> Figure 8
Frame of Channel Attention Mechanism. "> Figure 9
Frame of Spatial Attention Mechanism. "> Figure 10
Frame of Coordinate Attention Mechanism. "> Figure 11
Experimental Results on SUAV-DATA. (a) Ablation experiment results; (b) Parallel experiment results. "> Figure 12
AP and FLOPs corresponding to different algorithms. NOTE: The size of the circles is directly proportional to the corresponding algorithm’s parameter count. "> Figure 13
Comparison of the test results of the original YOLOv5 and TGC-YOLOv5 on the SUAV-DATA dataset. "> Figure 14
The four types of pollution (light, fog, stain, and saturation). NOTE: each column represents five levels of a specific type of pollution, recorded as 1, 2, 3, 4, and 5. ">

Versions Notes

Abstract

Drone detection is a significant research topic due to the potential security threats posed by the misuse of drones in both civilian and military domains. However, traditional drone detection methods are challenged by the drastic scale changes and complex ambiguity during drone flight, and it is difficult to detect small target drones quickly and efficiently. We propose an information-enhanced model based on improved YOLOv5 (TGC-YOLOv5) for fast and accurate detection of small target drones in complex environments. The main contributions of this paper are as follows: First, the Transformer encoder module is incorporated into YOLOv5 to augment attention toward the regions of interest. Second, the Global Attention Mechanism (GAM) is embraced to mitigate information diffusion among distinct layers and amplify the global cross-dimensional interaction features. Finally, the Coordinate Attention Mechanism (CA) is incorporated into the bottleneck part of C3, enhancing the extraction capability of local information for small targets. To enhance and verify the robustness and generalization of the model, a small target drone dataset (SUAV-DATA) is constructed in all-weather, multi-scenario, and complex environments. The experimental results show that based on the SUAV-DATA dataset, the AP value of TGC-YOLOv5 reaches 0.848, which is 2.5% higher than the original YOLOv5, and the Recall value of TGC-YOLOv5 reaches 0.823, which is a 3.8% improvement over the original YOLOv5. The robustness of our proposed model is also verified on the Real-World open-source image dataset, achieving the best accuracy in light, fog, stain, and saturation pollution images. The findings and methods of this paper have important significance and value for improving the efficiency and precision of drone detection.

Keywords:

drone detection; small target drones; YOLOv5; Transformer; GAM; CA

1. Introduction

In recent years, there has been rapid development in drone technology, with continuous improvement in their performance and functionality. The application scenarios of drones span numerous industries, including logistics, agriculture, transportation, and tourism, providing significant convenience for daily production and life. Drones can be divided into civilian consumer, civilian industrial, and military grades according to their application scenarios. Civilian consumer drones can meet people’s entertainment needs, such as aerial photography and navigation. Civilian Industrial drones are mainly used in agriculture, forestry, logistics, and disaster prevention. Although drones have brought convenience to our lives, their misuse can also cause safety problems and bring negative impacts on our lives. In terms of safety and privacy protection, identifying and detecting unauthorized drones can help monitor and prevent potential drone threats. For example, illegal invasion, reconnaissance, malicious attacks, and illegal filming by drones may threaten the safety of important areas, facilities, and personnel. Therefore, it is very important to identify and detect drones in the air.

Currently, the main difficulty of drone detection technology is to detect and identify small drones in complex environments. Small target detection needs to overcome multiple challenges, such as target-background confusion, target occlusion, target deformation, and lighting changes. In recent years, scholars have proposed various techniques for small target detection, broadly categorized into methods based on traditional image processing and methods based on deep learning.

1.1. Methods Based on Traditional Image Processing

This type of method mainly uses techniques of image processing, such as filtering, threshold segmentation, morphological operations, etc., to enhance the contrast and separability of small targets, and then uses classic target detection algorithms, such as HOG, SIFT, SURF, etc., to extract the features of and locate small targets [1,2,3,4]. For example, Srivastava et al. [5] proposed an improved hybrid morphological filter for detecting small targets. This filter provides a high gain in terms of signal-to-noise ratio plus clutter ratio (SCNR), which can effectively enhance small object detection performance. Wu et al. [6] used median filtering, lateral suppression algorithm, and a morphology-based shape matching method to detect the microelectronic components under the optical microscope, effectively improving the edge definition of the microelectronic components under the optical microscope and image quality, and achieving accurate recognition. Dheeman et al. [7] used Histogram of Oriented Gradients (HoG) to locate and label all weed and carrot leaf regions to extract information from regions of interest (ROI). During the decision-making process, plant areas were successfully classified with a success rate of 92%. Kang et al. [8] proposed a self-selective correlation filtering method based on frame regression (BRCF), which can effectively deal with the problem of ship size change and background interference and the success rate and accuracy on the maritime traffic data set More than 8 percentage points higher than Discriminative Scale Space Tracking (DSST). Tang et al. [9] applied the gradient-weighted histogram of the directional gradient algorithm to the global texture feature to extract the vector, and obtained the improved HOG feature. Finally, the average accuracy of the feature extraction classifier can reach 93.7%. Nebili et al. [10] evaluated a method based on bag of features framework, SIFT, and SVM. The object recognition ability of this method on two types of FLIR data sets is superior to the existing level, and the classification accuracy is increased by 3%. The advantage of this type of method is that it is simple and easy to implement, but the disadvantage is that it is sensitive to the selection of parameters, and it is difficult to adapt to complex and changeable scenarios.

1.2. Methods Based on Deep Learning

These methods primarily utilize deep neural networks such as CNN, R-CNN, Faster R-CNN, YOLO, etc., to automatically learn the feature representations and detection models of small targets. For instance, Zhou et al. [11] proposed a deep learning fusion method based on visual perception and image processing, combining classical image processing techniques with YOLOv7. The average positional deviation between the detection results of the two approaches was 5.2 pixels. Khalid et al. [12] employed models such as YOLOv6 and YOLOv8 for pest detection, achieving the highest mAP of 84.7% with YOLOv8. Chu et al. [13] proposed a CNN-based online MOT framework that leverages the advantages of single-object trackers to adapt appearance models and search for targets in the next frame. This algorithm achieved 34.3% and 46.0% on the challenging MOT15 and MOT16 benchmark datasets, respectively. Xu et al. [14] applied YOLO and JPDA algorithms to perform real-time multi-object detection and tracking on small drones, with average accuracies of 0.8 and 0.7 on public datasets and aerial videos captured by drones. Li et al. [15] introduced a novel Perceptual GAN model for detecting small objects by reducing the discrepancy in representations between large and small objects. In challenging benchmark tests such as Tsinghua-Tencent 100 K and Caltech, Perceptual GAN outperformed existing techniques in detecting small objects, including traffic signs and pedestrians. Cao et al. [16] proposed a multi-level feature fusion method that incorporates contextual information into SSD to improve the accuracy of small object detection. The fusion module achieved an mAP improvement of 1.6 and 1.7 points compared to the baseline SSD on PASCAL VOC2007. Liang et al. [17] combined RetinaNet with a bi-directional feature pyramid network, achieving a 1.8% mAP improvement over the original RetinaNet. Luo et al. [18] added 3-D deformable convolutions to the EfficientDet model as pre-detection to restrict detection within smaller regions, resulting in an 8.25% increase in mAP. Currently, in industrial automation, computer vision enables automated inspection and quality control, reducing labor costs and production time while improving product quality. For instance, Nath et al. [19] developed a vision-based surface defect classification framework by utilizing histogram equalization and adversarial training through neural structure learning (NSL), achieving an ultimate recognition accuracy of 92.4%. In the field of visual measurement applications, Tang et al. [20] proposed a novel visual crack width measurement method based on backbone dual-scale features, which achieves more accurate measurement results. Que et al. [21] introduced a digital image data augmentation approach using Generative Adversarial Networks (GAN) and an improved deep learning network (VGG) for crack classification. As a result, the improved VGG model showed a 5.9% increase in crack prediction accuracy. The advantages of these methods lie in their ability to effectively extract high-level semantic information, exhibiting strong generalization ability and robustness. However, they require a large amount of annotated data and computational resources, and the detection accuracy for small targets still needs improvement [22,23,24,25,26,27,28,29,30,31,32,33,34,35].

Due to the limited accuracy of current deep learning networks in detecting small objects, researchers have attempted to improve detection performance by improving the backbone network. For example, Andrew et al. [36] tried to replace the backbone feature extraction network with a lighter MobileNet network to achieve a lightweight network model and balance speed and accuracy. Wang et al. [37] reparametrized the ResNeXt model structure as the backbone of the model, resulting in a detection speed 14 frames faster than the original YOLOv5. Dai et al. [38] replaced the Conv in the C3 module with CrossConv, addressing the issue of feature similarity loss in the fusion process and enhancing feature representation, leading to a 7.5% increase in mAP. Wang et al. [39] replaced the traditional convolution with deformable convolution networks, resulting in a 2.3% improvement in detection accuracy compared to the original network model. Gao et al. [40] introduced a novel CNN building block called Res2Net, which outperformed the state-of-the-art baseline methods at the time by constructing hierarchical connection blocks similar to residual connections within a single residual. Wang et al. [41] incorporated transformers in the last convolutional module of the backbone network, effectively enhancing the feature extraction capability of the backbone network. Although these improvements have enhanced the feature representation capability, the overall performance improvement remains limited.

The performance issues in current backbone network improvements have prompted scholars to adopt various attention mechanisms to enhance the response and saliency of small targets in feature maps, thereby further improving small target detection. For instance, Yang et al. [42] integrated the scSE (spatial and channel compression and excitation) attention module into their algorithm, which allowed the backbone network to pay greater attention to the feature information of small targets, resulting in a 5.3% increase in mAP compared to the original network model. Hong et al. [43] incorporated the coordinate attention (CA) mechanism into the backbone feature extraction network, focusing more on the growth characteristics of asparagus. The improved YOLOv5 model achieved a 4.22% increase in [email protected] compared to the YOLOv5 prototype network. Gong et al. [44] replaced the original convolutional prediction head with the Swin Transformer prediction head (SPHs). SPHs, an advanced self-attention mechanism, reduces computational complexity to linear through its sliding window design. Additionally, they integrated the normalization-based attention modules (NAMs) into YOLOv5 to improve attention performance in a standardized manner. This approach resulted in a 0.071% increase in average precision mAP on the DOTA dataset. Xiao et al. [45] introduced the coordinate attention module (CA) into the YOLOv5 algorithm, improving the detection accuracy of damaged camellia seeds in stacked camellia seeds by 6.1%. Ren et al. [46] inserted the Efficient Channel Attention (ECA) module into the Ghost module to facilitate information interaction among channels and suppress redundant features, resulting in a 2.8% increase in AP. Qi et al. [47] incorporated the Squeeze-and-Excitation (SE) module into the YOLOv5 model, drawing inspiration from human visual attention mechanisms to extract crucial features. This method achieved a 1.78% improvement in average precision [email protected] compared to YOLOv5. Zhu et al. [48] improved the feature extraction capability of the Backbone module by integrating the Convolutional Block Attention Module (CBAM) attention mechanism. Although this improvement increased the recognition time, the model’s success rate in locating pickup points improved by 5.84% to 11.53%. Li et al. [49] introduced the Triplet Attention Mechanism into the YOLOV5 model to enhance feature extraction capability, resulting in an average mAP increase of 11.6%. Dai et al. [50] employed Faster R-CNN for object detection and utilized a multi-head attention structure composed of spatial attention and self-attention to improve the model’s ability to learn and utilize internal grammatical features of natural sentences. This model achieved improved accuracy in standard automatic evaluation metrics for captions. These improvements have achieved varying degrees of enhancement over the original models but have also resulted in higher model complexity and increased memory requirements.

1.3. This Work

Although the effectiveness of the aforementioned methods in improving object detection performance has been demonstrated, they still suffer from certain limitations, such as poor detection performance for small targets, high model complexity, and large memory footprint. Moreover, existing models lack sufficient feature information extraction for small target objects due to false detection issues caused by environmental factors and the occlusion conditions of small targets. When the image background is complex, blurry, and contaminated simultaneously, a significant amount of information loss occurs, making the detection of these objects challenging. Additionally, a single scale is unable to address the problem of low detection accuracy for small targets of different scales in complex backgrounds under all weather conditions. Models trained on a single scale also exhibit inadequate generalization capability for small target detection. To tackle the information loss problem in small target detection, we propose a framework called TGC-YOLOv5 for enhancing information in small target drone detection. From the perspective of improving the backbone network and optimizing attention mechanisms, we focus more effectively on regions of interest (ROI) in the images, reducing information loss in the detection of targets and improving the detection accuracy of small targets in complex backgrounds and occlusion conditions. Importantly, the model’s complexity is not significantly increased. The main contributions of this paper are as follows:

(1): We provide a small target dataset, SUAV-DATA, consisting of 10,000 images capturing small drones from different angles and under complex background conditions. Some targets in these images are occluded, and annotations are provided for all drones.
(2): We introduce a Transformer encoder module into YOLOv5, enhancing the capability to detect local information.
(3): We incorporate a global attention mechanism (GAM) to reduce message diffusion between different layers and amplify globally interactive features across dimensions. Additionally, we integrate a coordinate attention mechanism (CA) into the bottleneck part of C3, further enhancing the extraction capability of feature information for small targets.

2. Dataset

The current drone datasets primarily focus on a single drone in a ground-to-air scenario. Publicly available drone datasets can be classified into two main categories. The first category consists of drone videos captured frame by frame, allowing for customization of the number of targets of varying sizes. While the image quality is high, this type of dataset suffers from overfitting due to the limited diversity of drone types. The second category includes images of drones of different models captured in various scenes. Although this category offers a wide range of drone types, it still predominantly features medium and large-sized drones, with a limited representation of small-target drones. Currently, the publicly available image datasets suitable for drone detection include the Real World dataset [51], the Det-Fly dataset [52], the MIDGARD dataset [53], and the USC-Drone dataset [54]. The Real World dataset provides various types of drones captured in different shooting environments, albeit at a lower image resolution. The Det-Fly dataset encompasses images of drones captured in different attitudes, but it is limited to a single drone type. The MIDGARD and USC-Drone datasets both focus on a single drone type but offer a more diverse set of environments. However, the existing publicly available datasets suffer from limitations, such as the dominance of medium and large-sized drones, limited variety in shooting angles, lack of diverse backgrounds, and minimal occlusion and blur interference. Therefore, to address these limitations, we have created our dataset (SUAV-DATA) using image processing and enhancement techniques. This dataset includes numerous small target drones of different models captured in complex all-weather environments. Figure 1 illustrates some typical images from the dataset. Additionally, we emphasize the importance of ensuring the quality of the dataset, ensuring that the images are clear, accurate, and unambiguous, in order to enhance the precision and robustness of drone detection algorithms.

The SUAV-DATA dataset contains 10,000 drone images captured from various angles and complex backgrounds, some of which are partially obscured. All drone images have corresponding annotation files. The dataset includes various weather conditions, such as sunny, rainy, foggy, snowy, and sandy. The image background templates are mainly characterized by a variety of scenes such as cities, farmland, landscapes, war zones, skies, oceans, and streets, providing ample diversity. The drone templates have various characteristics of samples, angles, and types after image processing [23], which are shown in Figure 2. The dataset covers various specifications of drones, with small and medium-sized drones accounting for over 78% of the dataset. Shooting angles include the overhead view, horizontal view, and bird’s-eye view, with the overhead view accounting for 70%, the horizontal view accounting for 20%, and the bird’s-eye view accounting for 10%. The drone types are mainly composed of multi-rotor drones, including quadcopters, hexa-copters, and octocopters. The SUAV-DATA dataset effectively overcomes common problems in existing open datasets, such as a single drone type and shooting angle, limited background diversity, bias towards large and medium-sized drones, a lack of small drone samples, and limited obstacles and blurriness of drone targets in images.

There are currently two main definitions for multi-scale targets: one is based on the relative scale [24], which defines the intervals based on the ratio of bounding box size to image area; the other is based on the absolute scale [55], which divides pixel sizes into three levels: small targets (pixel size below 32²), medium targets (pixel size between 32² and 96²), and large targets (pixel size above 96²). The dataset includes three types of drones: quadcopters, hexa-copters, and fixed-wing drones. The statistical data of the quantities of the three scales of drones, i.e., large, medium, and small, in the dataset, are shown in Figure 3. It can be observed that the quantities of small and large drones are relatively balanced. Such a setup facilitates the model in capturing more feature information of small targets [56], thereby further enhancing the final recognition accuracy.

3. Framework

3.1. Overview of YOLOv5

YOLOv5 [57] consists of three parts: the Backbone, Neck, and YOLO Head. CSPDarknet is used as the backbone feature extraction network, which generates three effective feature layers from the input image. PANet is used for enhanced feature extraction, which fuses the three effective feature layers from the backbone and extracts features from them. It performs both up-sampling and down-sampling on features to achieve feature fusion. The YOLO Head serves as the classifier and regressor of YOLOv5, determining the corresponding objects for feature points. YOLOv5 has five derived models, namely YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, all with the same model architecture but different widths and depths. Smaller models have faster speeds, usually designed for mobile deployment, while larger models have better performance but higher computational complexity. To achieve faster detection speed, this paper chose the smallest model, YOLOv5s, and made corresponding improvements based on it.

3.2. TGC-YOLOv5

The TGC-YOLOv5 model proposed by us is based on the smallest model Yolov5s in the original Yolov5 series, and the original images are uniformly adjusted to a size of 640 × 640 when inputted. Firstly, the Transformer encoder model [58] is introduced in YOLOv5 to enhance the local information of object detection ability; then, the Global Attention Mechanism (GAM) is introduced to reduce the message diffusion between different layers and amplify the global cross-dimensional interactive features; finally, the Coordinated Attention Mechanism (CA) is added to the bottleneck part of CSPDarknet, which further improves the feature extraction ability for small target information. This section will introduce the improved Yolov5 network model in three parts, and the model framework is shown in Figure 4.

3.3. Transformer Encoder Block

Figure 5c shows the performance of YOLOv5s + Transformer in feature enhancement. Inspired by the Vision Transformer, we replaced the bottleneck corresponding to CSPDarknet53 in the eighth layer of the original Yolov5s version with a Transformer encoder block, as shown in Figure 6. Compared to the original CSP bottleneck, the Transformer encoder block utilizes multi-head attention mechanisms to obtain more abundant semantic features and employs encoder-decoder and attention mechanisms to achieve efficient parallelization and improve speed.

The Encoder part consists of N identical layers, each of which contains two sub-layers: a multi-head self-attention mechanism and a fully connected feed-forward network.

Each sub-layer is augmented with a residual connection and normalization, that is, a residual connection is used between the two sub-layers, followed by layer normalization. Therefore, the output of the sub-layer can be represented as:

sub_layer_output = LayerNorm (x + (subLayer (x)))

(1)

The Transformer utilizes an attention mechanism to reduce the distance between any two positions in a sequence to a constant. The attention layer can capture global connections in one step because it directly compares each element in the sequence with every other element (the cost becomes O(n²), but because it is a pure matrix operation, the computational burden is not too severe). In contrast, RNNs need to recursively capture connections, while CNNs need to stack layers to expand the receptive field, making the attention layer distinctly advantageous.

3.4. Global Attention Mechanism

The Global Attention Mechanism [59] is an attention mechanism that can amplify global feature interactions while reducing message diffusion. It adopts a sequential channel-spatial attention mechanism and modifies the processing method of the CAM(Channel Attention Mechanism) and SAM(Spatial Attention Mechanism) in the CBAM [60] (Convolutional Block Attention Module) submodule. The entire process is shown in Figure 7, and the input feature mapping

F_{1} \in R^{C \times H \times W}

, the intermediate state

F_{2}

, and the output

F_{3}

is given in Formulas (1) and (2).

F_{2} = M_{c} (F_{1}) \otimes F_{1}

(2)

F_{3} = M_{S} (F_{2}) \otimes F_{2}

(3)

In contrast to CBAM, Global Attention Mechanism first performs dimensionality transformation on the input feature map for CAM, which is then fed into an MLP and transformed back to its original dimensionality. After undergoing Sigmoid processing, the output is generated. The process is illustrated in Figure 8.

For SAM, Global Attention Mechanism mainly uses convolution processing, which is somewhat similar to the SE attention mechanism, reducing the number of channels first and then increasing it. Firstly, the channel number is reduced by convolution with a kernel size of 7 to reduce the computational cost. Then, a convolution operation with a kernel size of 7 is performed to increase the channel number, keeping the channel number consistent. Finally, after Sigmoid processing, the output is obtained. The process is shown in Figure 9. Figure 5d shows the performance of YOLOv5s + Transformer + GAM in feature enhancement.

3.5. Coordinate Attention Mechanism

The Coordinate Attention mechanism [61] encodes channel relationships and long-term dependencies with precise positional information. Figure 10 shows the flow chart, which consists of two steps: Coordinate Information Embedding and Coordinate Attention generation.

3.5.1. Coordinate Information Embedding

Global pooling methods are commonly used to encode spatial information for channel attention, but they compress global spatial information into channel descriptors, making it difficult to preserve positional information. To enable attention modules to capture remote spatial interactions with precise positional information, this paper decomposes global pooling into a pair of one-dimensional feature encoding operations according to the following formula:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j) .

(4)

Specifically, given an input X, the channel is encoded along the horizontal and vertical coordinates using a pooling kernel of size (H, 1) or (1, W), respectively. Therefore, the output of channel c at height h can be represented as:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i \leq W} x_{c} (h, j) .

(5)

Similarly, the output of the c-th channel with width w can be written as:

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j \leq H} x_{c} (j, w) .

(6)

The two transformations mentioned above aggregate features along two spatial directions to obtain a pair of direction-aware feature maps. This is very different from the SE Block that produces a single feature vector in channel attention methods. These two transformations also allow the attention module to capture long-term dependency relationships along one spatial direction and preserve precise position information along the other spatial direction, which helps the network more accurately locate the target of interest.

3.5.2. Coordinate Attention Generation

After the transformations in the information embedding stage, this section concatenates the above transformations and then applies the 1

\times

1 convolutional transformation function F1 to it:

f = δ (F_{1} ([z^{h}, z^{w}])),

(7)

The expression [▪, ▪] represents the concatenate operation along the spatial dimension, δ denotes the non-linear activation function, and f is the intermediate feature map encoding the spatial information in the horizontal and vertical directions. Then, f is decomposed into two separate tensors

f^{h} \in R^{C / r \times H}

and

f^{w} \in R^{C / r \times W}

along the spatial dimension. Using two additional 1

\times

1 convolutional transforms

F_{h}

and

F_{w}

f^{h}

and

f^{w}

are transformed into tensors with the same number of channels as the input X along the spatial dimension, resulting in:

g^{h} = σ (F_{h} (f^{h})),

(8)

g^{w} = σ (F_{w} (f^{w})) .

(9)

Here, σ is the sigmoid activation function. To reduce the complexity and computational cost of the model, an appropriate reduction ratio is usually used to reduce the number of channels in f. Then,

g^{h}

and

g^{w}

are expanded and used as attention weights. Finally, the output Y of the Coordinate Attention Block can be written as:

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j) .

(10)

The performance of TGC-YOLOv5s in feature enhancement is shown in Figure 5e. It can be seen from Figure 5 that with the addition of Transformer, GAM, and CA based on YOLOv5s, each model pays more attention to the regions of interest and performs better than the previous one, highlighting the areas of small and medium-sized objects and making the definition of drones clearer.

4. Results and Discussion

To validate whether the improved model can enhance the performance of small object detection, we conducted ablation and parallel experiments. In the experiments, we used Windows 11 Professional 64-bit operating system and NVIDIA GeForce RTX 3090 GPU for model training and testing and set the evaluation metric IoU to 0.5.

In our experiments, we used Precision, Recall, and mAP (mean Average Precision) to represent the detection results of each model, and used mAP to represent the average precision of the model as a whole. The Recall indicates the ratio of detected objects TP to all objects that should be detected TP + FN, while Precision represents the ratio of detected objects TP to all detected objects TP + FP. The formulas for these three-evaluation metrics are listed below.

precision = \frac{TP}{TP + FP}

(11)

Recall = \frac{TP}{TP + FN}

(12)

AP = \int_{0}^{1} P (R) d (R)

(13)

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(14)

where N in formula 14 represents the total number of categories. Among them, the TP (true positive) refers to instances that are correctly predicted as positive, where the true value of the data is positive and the predicted value is also positive. The TN (true negative) refers to instances that are correctly predicted as negative, where the true value of the data is negative and the predicted value is also negative. The FP (false positive) denotes instances that are incorrectly predicted as positive, where the true value of the data is negative, but they are mistakenly predicted as positive. The FN (false negative) represents instances that are incorrectly predicted as negative, where the true value of the data is positive, but they are mistakenly predicted as negative.

4.1. Determination of The TGC Method’s Position

We augmented the network’s capabilities by incorporating a Transformer Encoder module, Coordinate Attention, and GAM (Global Attention Module) after C3-3. To validate our selection, we introduced a Transformer Encoder module after C3-1 and a GAM after C3-2, respectively. The enhanced models were trained using the SUAV-DATA dataset, and the experimental results are presented in Table 1. The results indicate that the TGC method, integrated into the model after C3-3, achieves the highest precision, recall, and mAP scores. Therefore, the experimental results have confirmed the feasibility of our improved location selection.

4.2. Ablation and Parallel Experiments Results

The experimental results are shown in Table 2. In the ablation experiments, the combination of Yolov5s and Transformer achieved a 1% improvement in MAP over the original Yolov5s. This suggests that adding the Transformer model, which uses a multi-head attention mechanism, can obtain richer semantic features and improve detection performance. We then attempted to add different attention mechanisms on top of the Yolov5s-Transformer base. The attempted attention mechanisms were CBAM, SE, NAM, CA, and GAM. Among them, the addition of the GAM module resulted in a MAP of 0.837, which was a 0.4% improvement over the combination without GAM. These results suggest that the GAM module can reduce information diffusion and amplify global interaction features, thereby improving the detection accuracy of small target objects. Finally, by modifying the bottleneck part of C3 and introducing the Coordinate Attention (CA) module to form C3CA, the MAP of the model was improved by 1.1% compared to Yolov5s-Transformer-GAM. These results indicate that the introduction of the Coordinate Attention module further enhances the ability to extract feature information of small target objects in the network, greatly improving the detection accuracy of small target objects. Compared with the original Yolov5s model, the TGC-YOLOv5 module greatly improved the AP of small target drones. In the parallel experiments, the TGC-YOLOv5 model provided higher AP than other models. The experiments showed that TGC-YOLOv5 can effectively overcome the complex background and blurred conditions of small targets under all weather conditions, thereby improving detection accuracy. The experimental results indicate that, by improving the backbone module and utilizing the multi-head attention mechanism of the Transformer model, we can obtain richer semantic features, minimize the information loss caused by down-sampling every two steps, and introduce a global attention mechanism to further reduce information diffusion and amplify the global interaction features. On the other hand, introducing the CA module in the bottleneck part of C3 enhances the ability to extract feature information of small target objects in the network, effectively reducing information loss between layers and thereby improving the detection accuracy of the model. Figure 11a shows the results of the ablation experiment, and (b) shows the radar chart of the parallel experiment results.

The floating-point operation (FLOP) is a widely used metric in resource-efficient modules. In this study, the number of FLOPs refers to the number of multiply-add operations, which is used to measure the complexity of the algorithm or model. Params refer to the total number of trainable parameters in the network model [62]. As shown in Figure 12, TGC-YOLOv5 achieves the highest accuracy without causing excessive increases in FLOPs and Params. The results show that TGC-YOLOv5 achieves a high detection accuracy for small drones without increasing FLOPs and Params, achieving a good balance between memory and accuracy.

4.3. Comparison of Detection Performance for Different Sizes of Drones

To compare the detection performance of TGC-YOLOv5 on large, medium, and small drones within the SUAV-DATA dataset, we divided the SUAV-DATA dataset into three parts: large targets (pixel size above 96²), medium targets (pixel size between 32² and 96²), and small targets (pixel size below 32²). We trained and evaluated TGC-YOLOv5 on these three categorized drone datasets, and the experimental results are shown in Table 3.

From Table 3, it can be observed that both YOLOv5s and TGC-YOLOv5 exhibit similar detection performance for medium and large-sized drones, with mAP values exceeding 95%. However, the detection performance for small-sized drones is relatively poorer. TGC-YOLOv5 outperforms YOLOv5s in all three datasets, especially in the dataset containing small-sized drones, where TGC-YOLOv5 shows a 3.3% improvement in mAP. Based on the above analysis, TGC-YOLOv5 demonstrates good detection performance across three categories of drones: large, medium, and small.

4.4. Experimental Results on Public Drone Datasets

To evaluate the performance of TGC-YOLOv5 on other publicly available drone datasets, we selected the Real-World and Drone-dataset. Both datasets offer a diverse range of drone types and environments, with multiple perspectives. Table 4 presents the comparative experimental results of these two datasets. It can be observed that, in the Real-World dataset, TGC-YOLOv5 achieves a Precision value 0.2% higher than YOLOv5s, a Recall value 1.7% higher than YOLOv5s, and a mAP (mean Average Precision) value 0.9% higher than YOLOv5s. In the Drone-dataset, TGC-YOLOv5 achieves a Precision value 0.5% higher than YOLOv5s, a Recall value 1.1% higher than YOLOv5s, and a mAP value 1.4% higher than YOLOv5s. It can be observed that our TGC-YOLOv5 demonstrates improvements in various metrics compared to the original model, both on the Real-World and Drone-dataset.

4.5. Robustness Analysis

To validate the robustness of the model, we conducted robustness testing on the public remote sensing dataset Real World [51]. The ratio of training to validation sets in the dataset is 5:1. Figure 13 compares the results of the original model and our model. Our model can more accurately detect small target drones in complex environments, while the original model often has problems with false positives and missed detections, unable to detect drones that are small targets in the distance of the image, which are highlighted by the yellow oval box in Figure 13.

To validate the model’s anti-interference ability, we conducted anti-interference experiments. We used the Imgaug method to verify four pollution variables in the same dataset: light, fog, stains, and saturation [63]. Each pollution variable was divided into five different severity levels, as shown in Figure 14. Under the same model settings, a well-trained model of drone images without pollution was used to test 20 drone images with different pollution levels. Our experimental results are shown in Table 5.

The parallel experimental results in Table 5 indicate that TGC-YOLOv5 is superior to other models in terms of anti-interference. Different types of pollution in the environment have a negative impact on detection accuracy, causing some features to be obscured and reducing the model’s perception of features. From the changes in model detection results across different severity levels, we found that the model exhibited the best robustness against fog pollution, with a variation of 4.2% in average precision (AP) from severity level 1 to severity level 5. The robustness against saturation pollution was relatively good, with a variation rate of 4.8%. However, the model showed poorer robustness against light pollution and debris pollution, with variations of 6.3% and 17.9%, respectively. These results indicate that fog and saturation pollution have the least impact on the features of small targets in all-weather environments, but still obscure some location feature information. From the results of the robustness experiment, it is once again verified that TGC-YOLOv5 has high accuracy in detecting small target drones in all-weather environments.

4.6. Comparison of Small Object Detection Algorithms

To compare the performance of TGC-YOLOv5 in small object detection across different domains, we selected the VisDrone2021 dataset. This dataset contains abundant instances of small objects, encompassing a wide range of scenes, and is divided into ten categories, including person, car, van, bus, etc. It is popular and authoritative in the field of small object detection. As shown in Table 6, we summarized the performance of other algorithms in the field of small object detection over the past five years, comparing the metrics of mAP50, mAP, GFLOPs, and parameters. Through this comparison, it is evident that our model exhibits the best overall performance in terms of mAP50 and mAP, while maintaining relatively low levels of GFLOPs and parameters. This implies that our TGC-YOLOv5 model achieves the highest detection accuracy under the condition of a relatively low parameter count.

4.7. Discussion

Based on the previous experiments, it has been demonstrated that TGC-YOLOv5 exhibits the best overall performance on the SUAV-DATA dataset. It performs well in detecting large and medium-sized drones and shows a significant improvement in overall accuracy for small drones compared to the original YOLOv5. It also outperforms the original YOLOv5 on other publicly available drone datasets. In terms of robustness analysis, TGC-YOLOv5 demonstrates the best resistance to interference among all the compared models. Furthermore, TGC-YOLOv5 achieves the best overall performance on the publicly available small object dataset, VisDrone2021.

A portion of the images in the SUAV-DATA dataset is synthetically generated by combining drone imagery with image background templates. Due to the difficulties in obtaining drone images in different backgrounds, especially under various weather conditions such as rain, snow, and fog, the synthesis method can make up for the lack of data in these scenarios to a certain extent. These synthetic images may exhibit some differences compared to real data. To understand the limitations and potential biases introduced by the synthetic generation process, we conducted experiments accordingly. We considered separating the SUAV-DATA dataset into two categories: real images and synthetic images. We then trained the original YOLOv5s model and the TGC-YOLOv5 model on these two datasets separately. The experimental results are presented in Table 7.

According to Table 7, though the recognition performance of real data is a little better than that of synthetic data, the difference in performance is little. Besides, when using the improved model, there is a slight improvement in accuracy with synthetic data, but the overall difference is not significant, indicating that the impact of synthetic data on the results is relatively small. Therefore, we can assume that this kind of data has a small impact on our experiments to some extent.

We applied the TGC method to the latest YOLOv8 model. Table 8 presents the training results of YOLOv8s and TGC-YOLOv8 on the SUAV-DATA dataset. It can be observed that TGC-YOLOv8 achieved a 2.1% improvement in mAP compared to YOLOv8s, and a 1.2% improvement compared to TGC-YOLOv5. Therefore, we have reason to believe that the TGC method can be successfully extended to other object detection algorithms, yielding promising detection performance.

5. Conclusions

The potential security threat arising from the misuse of drones in both civilian and military domains has made drone detection an increasingly important research topic. The primary challenge in current drone detection technology lies in accurately detecting small drones in complex environments. However, mainstream deep learning-based object detection methods face challenges such as severe scale variations and complex environmental ambiguity during drone flights, making it difficult to detect small drones rapidly and accurately. This paper aims to address the issues related to small drone detection in complex environments by proposing an information enhancement model for small drones (TGC-YOLOv5). Firstly, we enhance the attention to regions of interest by integrating the Transformer encoder module into YOLOv5. Secondly, we employ a global attention mechanism to alleviate information diffusion between different layers and amplify global cross-dimensional interaction capabilities. Lastly, we introduce a coordinate attention mechanism into the bottleneck section of C3 to enhance the local information extraction ability for small targets. Additionally, to overcome the current lack of drone datasets with multiple perspectives, small targets, and complex environments, and to enhance the model’s robustness and generalization, we have created the SUAV-DATA dataset, which includes annotations for various-scale drones and annotations with all-weather environmental interference and blurry conditions. Experimental results demonstrate that, based on the SUAV-DATA dataset, TGC-YOLOv5 achieves an AP value of 0.848, which is a 2.5% improvement over the original YOLOv5, and a Recall value of 0.823, a 3.8% improvement over the original YOLOv5. Furthermore, the robustness of our proposed model has been verified on the Real-World Open Images dataset, achieving the best accuracy in terms of lighting, fog, smudge, and saturation-polluted images. The results and methods of this study hold significant importance and value for enhancing the efficiency and accuracy of drone detection. Moreover, this framework can also be applied to other domains involving small object detection, such as remote sensing image analysis, pedestrian and vehicle detection, etc. Future research could explore pruning of the model to facilitate deployment on mobile device platforms, further improving the performance and application scope of object detection.

Author Contributions

Conceptualization, Y.Z. and Z.J.; methodology, Z.J.; software, Z.J.; validation, Z.J., T.S. and F.D.; formal analysis, C.L.; investigation, Z.J.; resources, Y.Z.; data curation, Z.J.; writing—original draft preparation, Z.J.; writing—review and editing, Z.J. and C.L.; visualization, Z.J. and R.Y.; supervision, Q.F. and P.S.; project administration, J.L.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 61873307), the Hebei Natural Science Foundation (Grant Nos. F2020501040, F2021203070, and F2022501031), the Fundamental Research Funds for the Central Universities (Grant No. N2123004, 2022GFZD014), and the Administration of Central Funds Guiding the Local Science and Technology Development (Grant No. 206Z1702G).

Data Availability Statement

The data used in this analysis are not public, but available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Monika; Bansal, D.; Passi, A. Image Forgery Detection and Localization Using Block Based and Key-Point Based Feature Matching Forensic Investigation. Wirel. Pers. Commun. 2022, 127, 2823–2839. [Google Scholar] [CrossRef]
Gangadharan, K.; Kumari, G.R.N.; Dhanasekaran, D.; Malathi, K. Automatic detection of plant disease and insect attack using effta algorithm. Int. J. Adv. Comput. Sci. Appl. 2020, 11. [Google Scholar] [CrossRef] [Green Version]
Huynh, H.X.; Truong, B.Q.; Nguyen Thanh, K.T.; Truong, D.Q. Plant identification using new architecture convolutional neural networks combine with replacing the red of color channel image by vein morphology leaf. Vietnam J. Comput. Sci. 2020, 7, 197–208. [Google Scholar] [CrossRef]
Zebari, D.A.; Zeebaree, D.Q.; Abdulazeez, A.M.; Haron, H.; Hamed, H.N.A. Improved threshold based and trainable fully automated segmentation for breast cancer boundary and pectoral muscle in mammogram images. IEEE Access 2020, 8, 203097–203116. [Google Scholar] [CrossRef]
Srivastava, H.B.; Kumar, V.; Verma, H.; Sundaram, S. Image Pre-processing Algorithms for Detection of Small/Point Airborne Targets. Def. Sci. J. 2009, 59, 166–174. [Google Scholar] [CrossRef] [Green Version]
Jie, W.; Feng, Z.; Wang, L. High Recognition Ratio Image Processing Algorithm of Micro Electrical Components in Optical Microscope. TELKOMNIKA (Telecommun. Comput. Electron. Control) 2014, 12, 911–920. [Google Scholar] [CrossRef] [Green Version]
Saha, D. Development of Enhanced Weed Detection System with Adaptive Thresholding, K-Means and Support Vector Machine; South Dakota State University: Brookings, SD, USA, 2019. [Google Scholar]
Kang, X.; Song, B.; Guo, J.; Du, X.; Guizani, M. A self-selective correlation ship tracking method for smart ocean systems. Sensors 2019, 19, 821. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Tang, M.; Liang, K.; Qiu, J. Small insulator target detection based on multi-feature fusion. IET Image Process. 2023, 17, 1520–1533. [Google Scholar] [CrossRef]
Nebili, B.; Khellal, A.; Nemra, A. Histogram encoding of sift based visual words for target recognition in infrared images. In Proceedings of the 2021 International Conference on Recent Advances in Mathematics and Informatics (ICRAMI), Tebessa, Algeria, 21–22 September 2021; pp. 1–6. [Google Scholar]
Zhou, Y.; Tang, Y.; Zou, X.; Wu, M.; Tang, W.; Meng, F.; Zhang, Y.; Kang, H. Adaptive Active Positioning of Camellia oleifera Fruit Picking Points: Classical Image Processing and YOLOv7 Fusion Algorithm. Appl. Sci. 2022, 12, 12959. [Google Scholar] [CrossRef]
Khalid, S.; Oqaibi, H.M.; Aqib, M.; Hafeez, Y. Small Pests Detection in Field Crops Using Deep Learning Object Detection. Sustainability 2023, 15, 6815. [Google Scholar] [CrossRef]
Chu, Q.; Ouyang, W.; Li, H.; Wang, X.; Liu, B.; Yu, N. Online multi-object tracking using CNN-based single object tracker with spatial-temporal attention mechanism. In Proceedings of the IEEE International Conference on Computer Vision, Cambridge, MA, USA, 20–23 June 1995; pp. 4836–4845. [Google Scholar]
Xu, S.; Savvaris, A.; He, S.; Shin, H.-s.; Tsourdos, A. Real-time implementation of YOLO+ JPDA for small scale UAV multiple object tracking. In Proceedings of the 2018 International Conference on Unmanned Aircraft Systems (ICUAS), Dallas, TX, USA, 12–15 June 2018; pp. 1336–1341. [Google Scholar]
Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 1222–1230. [Google Scholar]
Cao, G.; Xie, X.; Yang, W.; Liao, Q.; Shi, G.; Wu, J. Feature-fused SSD: Fast detection for small objects. In Proceedings of the Ninth International Conference on Graphic and Image Processing (ICGIP 2017), Qingdao, China, 14–16 October 2017; pp. 381–388. [Google Scholar]
Liang, H.; Yang, J.; Shao, M. FE-RetinaNet: Small Target Detection with Parallel Multi-Scale Feature Enhancement. Symmetry 2021, 13, 950. [Google Scholar] [CrossRef]
Luo, X.; Wu, Y.; Wang, F. Target detection method of UAV aerial imagery based on improved YOLOv5. Remote Sens. 2022, 14, 5063. [Google Scholar] [CrossRef]
Nath, V.; Chattopadhyay, C.; Desai, K. On enhancing prediction abilities of vision-based metallic surface defect classification through adversarial training. Eng. Appl. Artif. Intell. 2023, 117, 105553. [Google Scholar] [CrossRef]
Tang, Y.; Huang, Z.; Chen, Z.; Chen, M.; Zhou, H.; Zhang, H.; Sun, J. Novel visual crack width measurement based on backbone double-scale features for improved detection automation. Eng. Struct. 2023, 274, 115158. [Google Scholar] [CrossRef]
Que, Y.; Dai, Y.; Ji, X.; Leung, A.K.; Chen, Z.; Tang, Y.; Jiang, Z. Automatic classification of asphalt pavement cracks using a novel integrated generative adversarial networks and improved VGG model. Eng. Struct. 2023, 277, 115406. [Google Scholar] [CrossRef]
He, H.; Chen, Q.; Xie, G.; Yang, B.; Li, S.; Zhou, B.; Gu, Y. A Lightweight Deep Learning Model for Real-time Detection and Recognition of Traffic Signs Images Based on YOLOv5. In Proceedings of the 2022 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), Suzhou, China, 17–18 November 2022; pp. 206–212. [Google Scholar]
Wei, J.; Wang, Q.; Song, X.; Zhao, Z. The Status and Challenges of Image Data Augmentation Algorithms. J. Phys. Conf. Ser. 2023, 2456, 012041. [Google Scholar] [CrossRef]
Chen, C.; Liu, M.-Y.; Tuzel, O.; Xiao, J. R-CNN for small object detection. In Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; pp. 214–230. [Google Scholar]
Huang, Z.; Wang, F.; You, H.; Hu, Y. STC-Det: A Slender Target Detector Combining Shadow and Target Information in Optical Satellite Images. Remote Sens. 2021, 13, 4183. [Google Scholar] [CrossRef]
Ju, M.; Luo, J.; Zhang, P.; He, M.; Luo, H. A simple and efficient network for small target detection. IEEE Access 2019, 7, 85771–85781. [Google Scholar] [CrossRef]
Liu, S.; Wu, R.; Qu, J.; Li, Y. HPN-SOE: Infrared Small Target Detection and Identification Algorithm Based on Heterogeneous Parallel Networks with Similarity Object Enhancement. IEEE Sens. J. 2023, 23, 13797–13809. [Google Scholar] [CrossRef]
Zhan, J.; Hu, Y.; Cai, W.; Zhou, G.; Li, L. PDAM–STPNNet: A small target detection approach for wildland fire smoke through remote sensing images. Symmetry 2021, 13, 2260. [Google Scholar] [CrossRef]
Chen, J.; Hong, H.; Song, B.; Guo, J.; Chen, C.; Xu, J. MDCT: Multi-Kernel Dilated Convolution and Transformer for One-Stage Object Detection of Remote Sensing Images. Remote Sens. 2023, 15, 371. [Google Scholar] [CrossRef]
Li, W.; Wang, Q.; Gao, S. PF-YOLOv4-Tiny: Towards Infrared Target Detection on Embedded Platform. Intell. Autom. Soft Comput. 2023, 37, 921–938. [Google Scholar] [CrossRef]
Chen, L.; Yang, Y.; Wang, Z.; Zhang, J.; Zhou, S.; Wu, L. Underwater Target Detection Lightweight Algorithm Based on Multi-Scale Feature Fusion. J. Mar. Sci. Eng. 2023, 11, 320. [Google Scholar] [CrossRef]
Li, X.; Diao, W.; Mao, Y.; Gao, P.; Mao, X.; Li, X.; Sun, X. OGMN: Occlusion-guided multi-task network for object detection in UAV images. ISPRS J. Photogramm. Remote Sens. 2023, 199, 242–257. [Google Scholar] [CrossRef]
Liu, X.; Wang, C.; Liu, L. Research on pedestrian detection model and compression technology for UAV images. Sensors 2022, 22, 9171. [Google Scholar] [CrossRef]
Shen, Y.; Liu, D.; Zhang, F.; Zhang, Q. Fast and accurate multi-class geospatial object detection with large-size remote sensing imagery using CNN and Truncated NMS. ISPRS J. Photogramm. Remote Sens. 2022, 191, 235–249. [Google Scholar] [CrossRef]
Xu, X.; Zhao, S.; Xu, C.; Wang, Z.; Zheng, Y.; Qian, X.; Bao, H. Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks. Drones 2023, 7, 250. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1314–1324. [Google Scholar]
Wang, H.; Xu, Y.; He, Y.; Cai, Y.; Chen, L.; Li, Y.; Sotelo, M.A.; Li, Z. YOLOv5-Fog: A multiobjective visual detection algorithm for fog driving scenes based on improved YOLOv5. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
Dai, G.; Hu, L.; Fan, J.; Yan, S.; Li, R. A Deep Learning-Based Object Detection Scheme by Improving YOLOv5 for Sprouted Potatoes Datasets. IEEE Access 2022, 10, 85416–85428. [Google Scholar] [CrossRef]
Wang, L.; Cao, Y.; Wang, S.; Song, X.; Zhang, S.; Zhang, J.; Niu, J. Investigation into recognition algorithm of Helmet violation based on YOLOv5-CBAM-DCN. IEEE Access 2022, 10, 60622–60632. [Google Scholar] [CrossRef]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [Green Version]
Wang, Z.; Zhang, H.; Lin, Z.; Tan, X.; Zhou, B. Prohibited Items Detection in Baggage Security Based on Improved YOLOv5. In Proceedings of the 2022 IEEE 2nd International Conference on Software Engineering and Artificial Intelligence (SEAI), Xiamen, China, 16–18 June 2022; pp. 20–25. [Google Scholar]
Yang, R.; Li, W.; Shang, X.; Zhu, D.; Man, X. KPE-YOLOv5: An Improved Small Target Detection Algorithm Based on YOLOv5. Electronics 2023, 12, 817. [Google Scholar] [CrossRef]
Hong, W.; Ma, Z.; Ye, B.; Yu, G.; Tang, T.; Zheng, M. Detection of Green Asparagus in Complex Environments Based on the Improved YOLOv5 Algorithm. Sensors 2023, 23, 1562. [Google Scholar] [CrossRef] [PubMed]
Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H. Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
Xiao, Z.; Sun, E.; Yuan, F.; Peng, J.; Liu, J. Detection Method of Damaged Camellia Oleifera Seeds Based on YOLOv5-CB. IEEE Access 2022, 10, 126133–126141. [Google Scholar] [CrossRef]
Ren, J.; Wang, Z.; Zhang, Y.; Liao, L. YOLOv5-R: Lightweight real-time detection based on improved YOLOv5. J. Electron. Imaging 2022, 31, 033033. [Google Scholar] [CrossRef]
Qi, J.; Liu, X.; Liu, K.; Xu, F.; Guo, H.; Tian, X.; Li, M.; Bao, Z.; Li, Y. An improved YOLOv5 model based on visual attention mechanism: Application to recognition of tomato virus disease. Comput. Electron. Agric. 2022, 194, 106780. [Google Scholar] [CrossRef]
Zhu, Y.; Li, S.; Du, W.; Du, Y.; Liu, P.; Li, X. Identification of table grapes in the natural environment based on an improved Yolov5 and localization of picking points. Precis. Agric. 2023, 24, 1333–1354. [Google Scholar] [CrossRef]
Li, Y.; Bai, X.; Xia, C. An Improved YOLOV5 Based on Triplet Attention and Prediction Head Optimization for Marine Organism Detection on Underwater Mobile Platforms. J. Mar. Sci. Eng. 2022, 10, 1230. [Google Scholar] [CrossRef]
Dai, J.; Zhang, X. Automatic image caption generation using deep learning and multimodal attention. Comput. Animat. Virtual Worlds 2022, 33, e2072. [Google Scholar] [CrossRef]
Pawełczyk, M.; Wojtyra, M. Real world object detection dataset for quadcopter unmanned aerial vehicle detection. IEEE Access 2020, 8, 174394–174409. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-air visual detection of micro-uavs: An experimental evaluation of deep learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]
Walter, V.; Vrba, M.; Saska, M. On training datasets for machine learning-based visual relative localization of micro-scale UAVs. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10674–10680. [Google Scholar]
Chen, Y.; Aggarwal, P.; Choi, J.; Kuo, C.-C.J. A deep learning approach to drone monitoring. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 686–691. [Google Scholar]
Torralba, A.; Fergus, R.; Freeman, W.T. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1958–1970. [Google Scholar] [CrossRef] [PubMed]
Dong, Z.; Wang, M.; Wang, Y.; Zhu, Y.; Zhang, Z. Object detection in high resolution remote sensing imagery based on convolutional neural networks with suitable object scale features. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2104–2114. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Borovec, J.; Chaurasia, A.; Changyu, L.; Hogan, A.; Hajek, J.; Diaconu, L.; Kwon, Y.; Defretin, Y. ultralytics/yolov5: v5. 0-YOLOv5-P6 1280 models, AWS, Supervise. ly and YouTube integrations. Zenodo 2021. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Kopuklu, O.; Kose, N.; Gunduz, A.; Rigoll, G. Resource efficient 3d convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef] [Green Version]
Cao, J.; Zhang, J.; Huang, W. Traffic sign detection and recognition using multi-scale fusion and prime sample attention. IEEE Access 2020, 9, 3579–3591. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Zhang, R.; Shao, Z.; Huang, X.; Wang, J.; Wang, Y.; Li, D. Adaptive dense pyramid network for object detection in UAV imagery. Neurocomputing 2022, 489, 377–389. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 6154–6162. [Google Scholar]
Chalavadi, V.; Jeripothula, P.; Datla, R.; Ch, S.B. mSODANet: A network for multi-scale object detection in aerial images using hierarchical dilated convolutions. Pattern Recognit. 2022, 126, 108548. [Google Scholar] [CrossRef]
Benz, U.C.; Hofmann, P.; Willhauck, G.; Lingenfelder, I.; Heynen, M. Isprs Journal of Photogrammetry and Remote Sensing. Sci. Technol. Prog. Policy 2004, 58, 239–258. [Google Scholar]
Zhang, R.; Shao, Z.; Huang, X.; Wang, J.; Li, D. Object detection in UAV images via global density fused convolutional network. Remote Sens. 2020, 12, 3140. [Google Scholar] [CrossRef]

Figure 1. SUAV-DATA.

Figure 2. Drone templates.

Figure 3. (a) Length and width distribution statistics of large, medium, and small drones in the SUAV-DATA dataset; (b) Statistical data of the number of large, medium, and small drones in the SUAV-DATA dataset.

Figure 4. Frame diagram of TGC-YOLOv5.

Figure 5. Feature enhancement performance graph of the ablation experimental model. NOTE: Figure (a) depicts the original input image, while figures (b–e) represent the heatmaps of head features obtained from models trained with four different algorithms: YOLOv5, YOLOv5 + Transformer, YOLOv5 + Transformer + GAM, and TGC-YOLOv5, respectively, for the same input image.

Figure 6. Frame of Transformer Encoder Block.

Figure 7. Frame of Global Attention Mechanism.

Figure 8. Frame of Channel Attention Mechanism.

Figure 9. Frame of Spatial Attention Mechanism.

Figure 10. Frame of Coordinate Attention Mechanism.

Figure 11. Experimental Results on SUAV-DATA. (a) Ablation experiment results; (b) Parallel experiment results.

Figure 12. AP and FLOPs corresponding to different algorithms. NOTE: The size of the circles is directly proportional to the corresponding algorithm’s parameter count.

Figure 13. Comparison of the test results of the original YOLOv5 and TGC-YOLOv5 on the SUAV-DATA dataset.

Figure 14. The four types of pollution (light, fog, stain, and saturation). NOTE: each column represents five levels of a specific type of pollution, recorded as 1, 2, 3, 4, and 5.

Table 1. Comparative experiments with the TGC method added after C3-1, C3-2, and C3-3.

Model	Precision	Recall	mAP	GFLOPs	Params(M)
TGC_C3-1	0.926	0.811	0.827	14.0	16.3
TGC_C3-2	0.934	0.817	0.832	18.5	17.2
TGC_C3-3	0.939	0.823	0.848	13.4	19.7

Table 2. Ablation experiments and parallel experiments of different models on SUAV-DATA.

Model	Precision	Recall	mAP	FPS	GFLOPs	Params (M)
Faster_rcnn-r50_fpn	0.801	0.588	0.801	31.5	210.6	43.8
RetinaNet	0.820	0.625	0.820	55.4	205.5	36.2
SSD 300	0.767	0.557	0.767	47.5	384.7	33.6
YOLOv5s	0.938	0.785	0.823	84.3	15.8	7.0
YOLOv5s- Transformer	0.940	0.790	0.833	77.1	16.0	8.5
Y-T-CBAM	0.925	0.785	0.831	83.7	14.3	9.1
Y-T-SE	0.903	0.799	0.830	75.4	16.5	13.3
Y-T-NAM	0.922	0.794	0.835	81.2	15.6	17.6
Y-T-CA	0.934	0.788	0.835	72.9	17.1	8.1
Y-T-GAM	0.948	0.790	0.837	85	14	20
TGC-YOLOv5(Ours)	0.939	0.823	0.848	86.5	13.4	19.7

Table 3. Comparative experiments on large, medium, and small target drones in the SUAV-DATA dataset.

Model	Small Drones			Medium Drones			Large Drones
Model	Precision	Recall	mAP	Precision	Recall	mAP	Precision	Recall	mAP
YOLOv5s	0.891	0.765	0.847	0.943	0.907	0.95	0.943	0.928	0.953
TGC-YOLOv5	0.902	0.817	0.88	0.976	0.915	0.965	0.935	0.947	0.962

Table 4. Comparative experiments on Drone-dataset and Real-World datasets.

Dataset	Model	Precision	Recall	mAP
Real-World	YOLOv5s	0.957	0.919	0.966
Real-World	TGC-YOLOv5	0.959	0.936	0.975
Drone-dataset	YOLOv5s	0.928	0.905	0.937
Drone-dataset	TGC-YOLOv5	0.933	0.916	0.951

Table 5. Test results of different levels of pollution.

Model		SSD 300	Faster_rcnn-r50_fpn	RetinaNet	YOLOv5s	TGC-YOLOv5 (Ours)
Light	1	0.851	0.856	0.860	0.863	0.870
	2	0.845	0.850	0.854	0.859	0.862
	3	0.828	0.830	0.834	0.837	0.841
	4	0.803	0.811	0.818	0.822	0.830
	5	0.781	0.794	0.799	0.804	0.807
Fog	1	0.843	0.847	0.852	0.861	0.865
	2	0.839	0.842	0.846	0.853	0.857
	3	0.826	0.829	0.831	0.833	0.836
	4	0.814	0.817	0.820	0.824	0.832
	5	0.813	0.816	0.814	0.819	0.823
Stain	1	0.848	0.852	0.854	0.860	0.867
	2	0.842	0.847	0.852	0.863	0.865
	3	0.834	0.833	0.835	0.838	0.844
	4	0.735	0.739	0.742	0.748	0.752
	5	0.678	0.674	0.681	0.686	0.688
Saturation	1	0.859	0.862	0.865	0.867	0.872
	2	0.846	0.853	0.856	0.862	0.868
	3	0.836	0.839	0.841	0.843	0.847
	4	0.825	0.828	0.832	0.836	0.842
	5	0.805	0.811	0.817	0.820	0.824

NOTE: The numbers 1, 2, 3, 4, and 5 in this column of the table represent five levels of four different pollution variables.

Table 6. Comparison of our method and other small object detection methods on VisDrone2021.

Model	mAP₅₀	mAP	GFLOPs	Params (M)
Faster-RCNN [64]	0.310	0.172	118.8	41.2
Cascade ADPN [65]	0.387	0.228	547.2	90.8
Cascade-RCNN [66]	0.388	0.226	146.6	69.0
mSODANet [67]	0.559	0.369	10.6	22.0
AdNet-SS [68]	0.579	0.311	32.8	77.2
YOLOv5s	0.537	0.317	16.3	7.04
YOLOv5m	0.586	0.354	48.2	20.9
RetinaNet [68]	0.443	0.227	35.7	36.4
Grid GDF [69]	0.308	0.182	257.6	72.0
SABL [68]	0.412	0.250	145.5	99.6
YOLOX-s	0.535	0.314	26.8	9.0
This work	0.597	0.385	13.4	19.7

Table 7. Comparative experiments on Real Data and Synthetic Data.

Data	Model	Precision	Recall	mAP	GFLOPs	Params(M)
Real Data	YOLOv5s	0.923	0.865	0.925	15.8	7.0
Real Data	TGC-YOLOv5	0.938	0.886	0.936	19.5	13.5
Synthetic Data	YOLOv5s	0.927	0.872	0.932	15.8	7.0
Synthetic Data	TGC-YOLOv5	0.946	0.877	0.945	19.5	13.5

Table 8. Comparative experiments of YOLOv8s and TGC-YOLOv8 on the SUAV-DATA dataset.

Model	Precision	Recall	mAP	GFLOPs	Params(M)
YOLOv8s	0.942	0.835	0.839	28.4	11.1
TGC-YOLOv8	0.957	0.848	0.86	27.7	21.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Ju, Z.; Sun, T.; Dong, F.; Li, J.; Yang, R.; Fu, Q.; Lian, C.; Shan, P. TGC-YOLOv5: An Enhanced YOLOv5 Drone Detection Model Based on Transformer, GAM & CA Attention Mechanism. Drones 2023, 7, 446. https://doi.org/10.3390/drones7070446

AMA Style

Zhao Y, Ju Z, Sun T, Dong F, Li J, Yang R, Fu Q, Lian C, Shan P. TGC-YOLOv5: An Enhanced YOLOv5 Drone Detection Model Based on Transformer, GAM & CA Attention Mechanism. Drones. 2023; 7(7):446. https://doi.org/10.3390/drones7070446

Chicago/Turabian Style

Zhao, Yuliang, Zhongjie Ju, Tianang Sun, Fanghecong Dong, Jian Li, Ruige Yang, Qiang Fu, Chao Lian, and Peng Shan. 2023. "TGC-YOLOv5: An Enhanced YOLOv5 Drone Detection Model Based on Transformer, GAM & CA Attention Mechanism" Drones 7, no. 7: 446. https://doi.org/10.3390/drones7070446

APA Style

Zhao, Y., Ju, Z., Sun, T., Dong, F., Li, J., Yang, R., Fu, Q., Lian, C., & Shan, P. (2023). TGC-YOLOv5: An Enhanced YOLOv5 Drone Detection Model Based on Transformer, GAM & CA Attention Mechanism. Drones, 7(7), 446. https://doi.org/10.3390/drones7070446

Article Menu