CN115588126A

CN115588126A - GAM, CARAFE and SnIoU fused vehicle target detection method

Info

Publication number: CN115588126A
Application number: CN202211194651.XA
Authority: CN
Inventors: 吴昌昊; 骆文辉; 徐徐; 邢凯
Original assignee: Yangtze River Delta Information Intelligence Innovation Research Institute
Current assignee: Yangtze River Delta Information Intelligence Innovation Research Institute
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2023-01-10

Abstract

The invention discloses a vehicle target detection method fusing GAM, CARAFE and SnIoU, which comprises the following steps: converting the data set into a format suitable for YOLOv5 training, performing data enhancement on the image, adding GAM modules in a YOLOv5 main network and a neck network, using CARAFE to replace nearest neighbor interpolation for up-sampling in the neck network, and finally using SnIoU-Loss as a Loss function of the algorithm to finish detection of various vehicles under a monitoring view angle. The invention combines a GAM attention mechanism in a backbone network, samples on a neck network combined attention module and content perception feature recombination, predicts recombination kernels by content information of a bottom layer, recombines features in a predefined nearby area, learns global weight information according to the features of different scales and efficiently fuses, and provides a loss function to help training convergence process and effect. The invention can solve the problems of the prior art that the target is blocked, blurred and has poor detection precision.

Description

GAM, CARAFE and SnIoU fused vehicle target detection method

Technical Field

The invention relates to the technical field of vehicle target detection, in particular to a vehicle target detection method fusing GAM, CARAFE and SnIoU.

Background

With the gradual increase of the living standard of people, vehicles for daily transportation are increasing, and how to effectively manage the vehicles on the road faces a huge challenge. Vehicle target detection is a key basic technology for building smart cities, and is long-paid attention by a plurality of researchers at home and abroad. There are two main approaches: one method is to extract target features by using a traditional machine learning method, such as an HOG method, and input the extracted target features into a classifier, such as a Support Vector Machine (SVM), an iterator (AdaBoost), and the like, for classification detection; and the other is to automatically complete the task of feature extraction and detection of the target by using a deep learning technology (such as a convolutional neural network). Compared with a picture data set, the video data set has the conditions of fuzzy objects, mutual shielding and the like, so that the target information is difficult to be correctly extracted by the conventional method, and the correct positioning and classification are difficult to be carried out.

There are many related inventions disclosed using YOLOv5 for efficient detection of targets. For example, the publication No. CN114882393A, the publication date of the invention is 2022, 8 months and 9 days, i.e., discloses a method for detecting road retrograde motion and traffic accident events based on target detection, which comprises the following steps: s1, acquiring original data; s2, obtaining a sample from the original data, and marking the position of the vehicle in the frame picture and the vehicle type in the sample; s3, obtaining a training set and a verification set through data processing; s4, improving a data enhancement method and an activation function of the original YOLOv5 to obtain a YOLOv 5-beta model; s5, inputting the training set and the verification set into a YOLOv 5-beter model, and obtaining a weight file of an improved model through training; s6, inputting the obtained weight file into a YOLOv 5-beter model, performing test set test to obtain vehicle information, and inputting the vehicle information into deepsort to obtain the serial number id and the type of the vehicle; and S7, inputting the corresponding position information of each id in the video frame into a logic judgment algorithm to judge whether the vehicle is running in the wrong direction or an accident occurs. The method for detecting the road reverse driving and traffic accident events based on the target detection can be suitable for intelligent video analysis work. However, the above models often cannot combine more global semantic information for object detection.

Disclosure of Invention

1. Technical problem to be solved by the invention

In order to overcome the problems in the prior art, the invention provides a vehicle target detection method fusing GAM, CARAFE and SnIoU; according to the method, the position and local information are mined through the GAM module, the semantic information of the target is better extracted through CARAFE upsampling, and finally the model is converged more quickly and accurately through SnIoU-Loss, so that not only can the self-adaptive kernel be dynamically generated by using the high-level semantic information, but also the precision of the prediction frame can be improved through the regression vector angle.

2. Technical scheme

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the invention discloses a vehicle target detection method fusing GAM, CARAFE and SnIoU, which comprises the following steps:

step 1, acquiring a vehicle target detection data set;

step 2, preprocessing the data set picture;

step 3, constructing a detection network, and introducing a GAM module into a backbone network and a neck network of YOLOv 5;

step 4, using CARAFE to replace nearest neighbor interpolation upsampling in a neck network of YOLOv 5;

step 5, replacing the Loss function with SnIoU-Loss;

and 6, inputting the training set into an improved YOLOv5-GCS model for training to obtain a weight file, and then testing the test set by using the weight file to obtain a final result.

Further, the vehicle target detection data set needs to be converted into a format suitable for YOLOv5 training in step 1, and data enhancement is performed on the data set characteristics in step 2.

Further, step 3 adds [ -1, gam attention, [ n, n ] ], to the backbone and neck network, i.e. layers 9, 19, 23, 27, in the version source code YOLOv5s.yaml file, where-1 represents the input of the layer from the output of the previous layer, 1 represents the repetition number of the layer, and [ n, n ] represents the number of input channels and the number of output channels are both n, and the number of channels in different layers is not consistent.

Furthermore, the GAM module introduced in step 3 is divided into two parts, namely a front part and a back part, wherein the first part is channel attention, and the second part is space attention, and the following is specific:

substep 1 obtains an input vector from the last convolutional layer

Through the channel attention M _c Becomes a vector F ₂ ；

Substep 2 input vector

Attention M through space _s Becomes a vector F ₃ 。

Further, step 4 replaces the original nn. Upsample nearest neighbor interpolation with the caroafe upsampling at the neck network, i.e. layers 12, 16, in the yollov 5 version source code YOLOv5s.yaml file.

Further, the modified upsampling method has the following flow:

substep 1 for an input profile with a shape of H × W × C, the number of channels is first compressed to C by a 1 × 1 convolution _m ；

Substep 2 sets the upsampling kernel shape to be predicted to σ H × σ W × k _up ×k _up Wherein σ is an upsampling magnification;

for the input feature map compressed in substep 1, use is made of a k _encoder ×k _encoder Predicting the upsampled kernel by the convolution layer of (C), the number of input channels being _m The number of output channels is

The channels are then maintained in the spatial dimensionIs unfolded to obtain a shape of

The upsampling core of (a);

substep 3, normalizing the up-sampling kernel obtained in the substep 2 by utilizing softmax, so that the weight sum of convolution kernels is 1;

substep 4 for each position in the output profile, mapping it back to the input profile, taking out the k centered on it _up ×k _up And (3) performing dot product with the up-sampling kernel of the point predicted by the substep 2 to obtain an output value, wherein different channels at the same position share the same up-sampling kernel.

Further, the process of calculating SnIoU-Loss in step 5 is as follows:

substep 1 calculating an angle loss Λ;

substep 2, calculating distance loss delta according to the angle loss lambda;

substep 3 calculating a shape loss Ω;

substep 4 calculating SnIoU-Loss

The IoU is the ratio of the intersection and union of the prediction frame and the real frame, and n is a constant.

3. Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following remarkable effects:

in the prior monitoring, a field tiled monitoring equipment list is implemented, and the monitoring position is not intuitive; the equipment operation is comparatively traditional, needs the manual inspection of high frequency. At present, the problems of low efficiency, low reliability and the like exist in manual detection under huge industrial scale. With the advent of large-scale data sets, the difficulty of machine learning feature engineering is increasing, while deep learning models can learn the intrinsic features of data from the data itself. According to the method, the position and local information are mined through the GAM module, the semantic information of the target is better extracted through CARAFE upsampling, and finally the model is converged more quickly and accurately through SnIoU-Loss, so that the detection of the road vehicle target is completed. By utilizing the improved YOLOv5-GCS model, the vehicle target detection with high speed and high accuracy can be realized, and the method has important significance for the field application of maintaining road safety and relieving traffic jam.

Drawings

FIG. 1 is a schematic diagram of a model structure according to the present invention;

FIG. 2 is a schematic diagram of the channel attention submodule of the present invention;

FIG. 3 is a schematic diagram of the spatial attention submodule of the present invention;

FIG. 4 is a schematic diagram of the CARAFE upsampling module of the present invention;

FIG. 5 is a schematic diagram of an angle relationship between a real frame and a predicted frame according to the present invention;

FIG. 6 is a schematic diagram of the cross-comparison between the real frame and the predicted frame in the present invention;

FIG. 7 is a confusion matrix thermodynamic diagram of a model of the invention;

FIG. 8 is a flow chart of the detection method of the present invention.

Detailed Description

For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples.

Example 1

The vehicle target detection model proposed in the present embodiment has a structure as shown in fig. 1. With reference to fig. 8, the method for detecting a vehicle target by fusing GAM, caraafe and SnIoU in this embodiment includes the following specific steps:

step 1 searching vehicle target detection data set close to monitoring visual angle

The UA-DETRAC data set of the vehicle is shot by using a road overpass mainly in Beijing and Tianjin, and is converted into a format suitable for Yolov5 training.

Step 2, preprocessing the data set picture

It is data enhanced for data set characteristics.

Substep 1 data enhancement was first performed with hsv, taking into account weather, light, etc.

The substep 2 performs processing using the picture horizontal inversion in consideration of the traveling direction of the vehicle.

And the substep 3 randomly cuts the four pictures, and then splices the four pictures to one picture as training data, so that the background of the pictures is enriched, and the blocksize is improved by splicing the four pictures together in a phase-changing manner.

Step 3, constructing a detection network, and introducing GAM modules into a backbone network and a neck network of YOLOv5

The kernel weights of CNNs are shared, i.e., all convolution kernels in the same layer have a set of same and shared weights, i.e., the functions of detecting features at different positions of a graph are the same, so that the corresponding feature graph responses are translated identically when the object in the graph is translated. Moreover, since the feature map is compressed by the pooling (posing) step, the final result of CNN is the same as before the translation.

The positional information of the vehicle at the monitoring perspective is essentially fixed and GAM is a global attention mechanism that improves the performance of the deep neural network by reducing information dispersion and enlarging the global interactive representation, capturing important features across all three-dimensional channels, spatial widths and spatial heights.

In this embodiment, in a source code YOLOv5s.yaml file of YOLOv5 version, a GAM module [ -1, GAM attention, [ n, n ] ] -1 is added to the 9 th, 19 th, 23 th and 27 th layers of the backbone and neck network, i.e., the input of the layer is output from the previous layer, 1 represents the number of repetitions of the layer, and [ n, n ] represents that the number of input channels and the number of output channels are both n, the number of channels in different layers is not the same, and the GAM attention allows the model to pay more attention to a certain part of the region. The GAM module is divided into a front part and a rear part, wherein the first part is channel attention, and the second part is space attention, and the detailed description is as follows:

substep 1 obtains the input vector from the last convolutional layer

Through the channel attention M _c Becomes a vector F ₂ The calculation formula is as follows:

wherein

Representing corresponding multiplication of co-located elements, M _c The function is shown in fig. 2.

M _C (F ₁ )＝σ(MLP(P(F ₁ ))

＝σP′(W ₁ (W ₀ (P(F ₁ ))))

Where P is the channel transform, putting the first dimension to the last dimension, W ₀ 、W ₁ The matrix is a fully-connected weight matrix, P' is inverse channel transformation, the matrix after P transformation is recovered, and sigma is a Sigmoid activation function.

Substep 2 input vector

Attention M through space _s Becomes a vector F ₃ The calculation formula is as follows:

wherein

Representing corresponding multiplication of co-located elements, M _s The function is shown in fig. 3.

M _S (F ₂ )＝σ(BN(Conv ^7*7 (BN(Conv ^7*7 (F ₂ )))))

Wherein Conv ^7*7 Represents a convolution kernel of 7 × 7, BN is batch normalization, σ is Sigmoid activation function.

Step 4 replace nearest neighbor interpolation upsampling using CARAFE in the neck network

The upsampling operation may be expressed as a dot product of the upsampling kernel at each location and the pixels of the corresponding neighborhood in the input feature map, which is referred to as feature reorganization. The upper sampling operation CARAFE can have a larger receptive field during recombination, the recombination process can be guided according to the input characteristics, semantic information extracted by the convolution operation before can be better expressed, and a better sampling effect is achieved on fuzzy vehicles.

In this embodiment, in a source code YOLOv5s.yaml file of a version of YOLOv5, original nn.upsample nearest neighbor interpolation is replaced by CARAFE upsampling in a neck network, i.e., layers 12 and 16. The modified upsampling method has the following flow:

substep 1 signature channel compression

For an input feature map with the shape of H × W × C (the input feature map is the output feature map of the previous layer), the number of channels is firstly compressed to C by using a 1 × 1 convolution _m The main purpose of this step is to reduce the computational load of the subsequent steps.

Sub-step 2 content coding and upsampling kernel prediction

Assuming an upsampled kernel size of k _up ×k _up (larger upsampling kernel means larger receptive field and larger calculation amount), in the present embodiment, it is desirable to use different upsampling kernels for each position of the output feature map, so the upsampling kernel shape to be predicted is set to be σ H × σ W × k _up ×k _up Where σ is the upsampling magnification.

For the compressed input feature map in substep 1, use is made of a k _encoder ×k _encoder Predicting the upsampled core by the convolution layer of (C) _m The number of output channels is

Then, the channel dimension is expanded in the space dimension to obtain the shape of

The upsampling kernel of (1).

Substep 3 upsampling kernel normalization

The upsampling kernel obtained in sub-step 2 is normalized by softmax so that the sum of the weights of the convolution kernels is 1.

Substep 4 feature reorganization

For each position in the output profile, it is mapped back to the input profile, taking out the k centered on it _up ×k _up And (3) performing dot product with the upsampling kernel of the point predicted by the substep 2 to obtain an output value. Different channels at the same location share the same upsampling core.

The specific structure of the up-sampling kernel prediction and the feature reorganization is shown in fig. 4.

Step 5, replacing the Loss function with SnIoU-Loss

Where the cars are dense, the prediction boxes for different vehicles but at close distances may be processed by the NMS. The vector regression angle is introduced into the SioU-Loss, so that a prediction frame can be converged more quickly, NMS (network management system) processing is prevented, an index n is fused, and the precision is further improved.

Substep 1 calculation of the angular loss

The model will try to first bring the predicted box to either the horizontal X-axis or the vertical Y-axis of the real box (whichever is closest), and then continue the approach along the relevant axis. In the convergence process if

Will try to minimize alpha first, otherwise minimize

The angle cost calculation process is shown in fig. 5.

Wherein c is _h Is the height difference between the center points of the real frame and the predicted frame, sigma is the distance between the center points of the real frame and the predicted frame,

equal to the angle alpha.

Substep 2 calculating distance loss

In view of the above-mentioned angle loss, the distance loss is redefined:

wherein:

herein (c) _w ，c _h ) Width difference and height difference of center points of real frame and predicted frame, (c) _w2 ，c _h2 ) The width and the height of the minimum bounding rectangle of the real frame and the prediction frame are shown.

Substep 3 calculating the shape loss

The shape loss is defined as follows:

wherein:

here, (w, h) and (w) ^gt ，h ^gt ) Theta controls the degree of concern over shape loss for the width and height of the prediction box and the real box, respectively, and in order to avoid over-concern over shape loss and reduce movement of the prediction box, the present invention uses a genetic algorithm to calculate theta close to 4, and thus is defined as a theta parameter in the range of [2,6 ] for the purpose of reducing movement of the prediction box]。

Substep 4 calculating SnIoU-Loss

Where IoU is shown in FIG. 6, the ratio of the intersection and union of the prediction box and the real box.

Where n is typically 3, the gradient acceleration convergence may be increased.

And 6, inputting the training set of the UA-DETRAC into an improved YOLOv5-GCS model for training to obtain a weight file, and then testing the UA-DETRAC test set by using the weight file to obtain a final result.

The official test set classifies all vehicles into the same class. The Faster R-CNN uses the generation of the region proposal as the first stage, which usually shows higher recognition accuracy, but the generation of a large number of candidate frames greatly reduces the execution efficiency of the system. RN-VIDs solve the ambiguity problem with the help of optical flow and future frames, but calculating the optical flow and using the future information can make online detection difficult. The centeret detects an object as a point, and cannot completely draw a detection frame in case of serious occlusion between vehicles. YOLOv5 balances speed and accuracy, allowing real-time target detection. Table 1 lists the average accuracy of these models at an IoU threshold of 0.7, with the model reaching the highest accuracy for a single class.

TABLE 1 average accuracy of different models

The traffic policies are different for different vehicle driving behaviors, and in view of the above, the categories are classified into car, van, bus and others according to the original labels and retrained again. In order to more intuitively and effectively show the effect of the model, a confusion matrix thermodynamic diagram of the test result of the model is given below, and an ablation test table of part of modules is given. FIG. 7 is a confusion matrix thermodynamic diagram of the test results of the YOLOv5-GCS model proposed by the present invention, wherein the color depth of squares represents the prediction rate. As can be seen from fig. 7, since others are various vehicles, such as: police cars, construction vehicles, trucks, etc., and the number of car samples is so large that it is easily misclassified as cars. The rest confusion matrixes show that the model provided by the invention has better prediction performance. From table 2, it can be seen that the detection accuracy of the model provided by the present invention is superior to that of the original YOLOv5 model, and the superiority of the model of the present invention is further demonstrated.

Ablation experiments for the modules of Table 2

The invention utilizes a target detection model YOLOv 5.1 (YOLOv 5 for short) which is popular in hand and is iterated continuously from release to now to replace the traditional models such as HOG, DPM and the like, so that the model can efficiently detect vehicles under different backgrounds and angles. The invention combines a GAM attention mechanism with a backbone network of YOLOv5, thereby being capable of amplifying global dimension interaction characteristics under the condition of reducing information dispersion. According to the method, a neck network combined attention module (GAM) and a content perception feature recombination upsampling (CARAFE) are adopted, the recombination kernel is predicted by the content information of the bottom layer, the features are recombined in a predefined nearby area, and then global weight information is learned and efficiently fused according to the features with different scales. In addition, the invention also provides a Loss function SnIoU-Loss based on the SIoU-Loss, and the Loss function greatly helps the training convergence process and effect by introducing a regression vector angle and an index n so as to solve the problem of insufficient detection precision in the prior art and finally obtain an improved YOLOv5-GCS model detection algorithm. By adopting the vehicle target detection method based on deep learning, the problems that the existing target is shielded and blurred and the detection precision is poor can be solved.

The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims

1. A method for detecting a vehicle target by fusing GAM, CARAFE and SnIoU is characterized by comprising the following steps:

step 1, acquiring a vehicle target detection data set;

step 2, preprocessing a data set picture;

step 5, replacing the Loss function with SnIoU-Loss;

2. The vehicle target detection method fusing GAM, CARAFE and SnIoU according to claim 1, characterized in that: the vehicle target detection dataset in step 1 needs to be converted into a format suitable for YOLOv5 training, and data enhancement is performed on the dataset in step 2 for the characteristics of the dataset.

3. The method of claim 2 for vehicle object detection incorporating GAM, CARAFE and SnIoU, wherein: step 3, in a source code Yolov5s.yaml file of a version of Yolov5, [ -1, GAM attention, [ n, n ] ], is added to the trunk and neck network, namely 9 th, 19 th, 23 th and 27 th layers, wherein-1 represents the input of the layer from the output of the previous layer, 1 represents the repetition number of the layer, and [ n, n ] represents that the number of input channels and the number of output channels are both n, and the number of channels in different layers is not consistent.

4. The vehicle target detection method fusing GAM, CARAFE and SnIoU according to claim 3, wherein: the GAM module introduced in the step 3 is divided into a front part and a rear part, wherein the first part is channel attention, and the second part is space attention, and the method specifically comprises the following steps:

substep 1 obtains an input vector from the last convolutional layer

Through the channel attention M _c Becomes a vector F ₂ ；

Substep 2 input toMeasurement of

Attention M through space _s Becomes a vector F ₃ 。

5. The method for vehicle object detection fusing GAM, CARAFE and SnIoU according to any one of claims 1-4, wherein: step 4, in a version source code YOLOv5s.yaml file of YOLOv5, original nn.UpSample nearest neighbor interpolation is replaced by CARAFE up-sampling in a neck network, namely layers 12 and 16.

6. The vehicle target detection method fusing GAM, CARAFE and SnIoU according to claim 5, wherein: the modified upsampling method has the following flow:

substep 1 for an input profile of H × W × C, the number of channels is first compressed to C by a 1 × 1 convolution _m ；

The upsampling core of (a);

substep 4 for each position in the output profile, mapping it back to the input profile, taking out the k centered on it _up ×k _up And sub-step 2 predictionAnd performing dot product on the obtained up-sampling kernel of the point to obtain an output value, wherein different channels at the same position share the same up-sampling kernel.

7. The vehicle target detection method fusing GAM, CARAFE and SnIoU according to claim 6, wherein: the SnIoU-Loss calculation process in the step 5 is as follows:

substep 1, calculating an angle loss Lambda;

substep 2, calculating distance loss delta according to the angle loss lambda;

substep 3 calculating a shape loss Ω;

substep 4 calculating SnIoU-Loss