Background
With the rapid development of deep learning, the detection efficiency and detection accuracy of target detection as an important research direction of computer vision are greatly improved, but the detection effect of the existing target detection cannot be satisfied, and the method cannot be applied to target detection under adverse conditions of complicated image background, high environmental noise, low contrast, uneven illumination and the like.
Taking the detection of train bottom parts as an example, the train bottom parts are one of important train components as a necessary condition for train operation, and in order to ensure safe operation, the components of an inbound train need to be routinely checked. The detection method is generally divided into two methods, the first method is to carry out visual inspection on important parts by manpower, but with the rapid increase of the number of trains, the problems of visual fatigue, inattention or illusion can occur due to the complex environment at the bottom of the trains in the long-time monotonous manual inspection, so that the detection omission is easily caused, and the safe operation of the trains can be influenced.
The traditional target detection algorithm is mainly divided into three steps including region selection, feature extraction and classifier classification. The first step is to select a region, which is to locate the position of the target, and since the target may appear at any position of the image and the size and aspect ratio of the target are uncertain, the whole image is initially traversed by using a sliding window strategy, and different scales and different aspect ratios need to be set. Although this exhaustive strategy contains all possible positions of the target, the disadvantages are also evident, the time complexity of this method is too high, too many redundant windows are generated, which also seriously affects the speed and performance of the subsequent feature extraction and classification. And the second step of feature extraction is not easy to design a robust feature due to factors such as the form diversity, illumination variation diversity and background diversity of the target. However, the accuracy of classification is directly affected by the quality of the extracted features. And thirdly, classifying, namely classifying the features extracted in the last step by a classifier, and generally classifying by a support vector machine.
In summary, there are several major problems with conventional target detection methods: the area selection strategy based on the sliding window has no pertinence, the time complexity is high, the window is redundant, the manually designed characteristics have no good robustness to the change of diversity, and the collected photos have great difference in the aspects of image background, environmental noise, contrast and exposure, so that the target detection under multiple scenes is difficult to realize based on a single type of image processing technology.
Disclosure of Invention
The invention provides a target detection method based on a deep neural network, and solves the problems of large calculation amount, long consumed time, poor generalization capability and low identification precision of target detection in the prior art.
The invention provides a target detection method based on a deep neural network, which comprises the following steps:
step 1, obtaining a target detection object image set;
step 2, preprocessing the target detection object image set to obtain a data set, and constructing a training sample set according to the data set;
step 3, constructing a deep neural network, wherein the deep neural network comprises a feature extraction module, a feature fusion module and a classification and regression module; the feature extraction module is a new network structure eSE-dResNet combining a d-ResNet network and an eSEnet module;
step 4, training the deep neural network by using the training sample set to generate a target detection model;
and 5, inputting the image of the object to be detected into the target detection model to obtain a target detection result.
Preferably, in the step 2, the preprocessing the target detection object image set includes: cutting and correcting the original image; if the original images in the target detection object image set are consistent in width and unequal in height, maintaining the image width unchanged, and cutting the images at different heights, wherein the cutting correction is realized by adopting the following mode:
h=(w-h1)n+(n-1)h1
where h and w represent the total length and width of the original picture, respectively, h1Indicating the height of the excess rectangle after cutting out n pictures.
Preferably, in the step 2, the preprocessing the target detection object image set further includes: expanding the cut and corrected data set to obtain an expanded data set; and marking the target contained in the target detection image in the expanded data set by using a marking tool.
Preferably, in step 3, the d-ResNet network is obtained by adding two cross-layer connections to an identity block in an original ResNet50 structure; the d-ResNet network performs characteristic splicing operation on the input of the first 1 x 1 volume block, the output of the first 1 x 1 volume block and the output of the 3 x 3 volume block, and then takes the spliced result as the input of the second 1 x 1 volume block;
the eSEnet module is embedded between an identity block and a conv block in the d-ResNet network; and the eSEnet module replaces the original two fully-connected layers of the excitation part in the SEnet into a convolution layer with the convolution kernel size of 1.
Preferably, in step 3, the feature fusion module performs feature fusion of different dimensions by using a feature pyramid structure.
Preferably, in the step 3, the feature extraction module includes P1~PiI stages in total, the feature fusion module comprises Ci~CjA total of i-j +1 stages;
to PiPerforming dimensionality reduction operation on the calculation result of the stage to obtain CiThe result of the stage calculation, CiIntermediate result and P obtained after up-sampling operation is carried out on the calculation result of the stagei-1Adding intermediate results obtained after dimensionality reduction operation is carried out on the calculation results of the stages to obtain Ci-1Calculating results of the stages;
c is to bem+1Intermediate result and P obtained after up-sampling operation is carried out on the calculation result of the stagemAdding intermediate results obtained after dimensionality reduction operation is carried out on the calculation results of the stages to obtain CmCalculating results of the stages; wherein m is [ j, i-2 ]]。
Preferably, in the step 3, the classification and regression module includes: classifying sub-networks and regressing sub-networks;
obtaining a classification result through the classification sub-network, and obtaining prior frame coordinate change information through the regression sub-network; obtaining prior frame parameter information by using a k-means clustering algorithm, and obtaining predicted frame position information according to the prior frame parameter information and the coordinate change information of the prior frame; after a plurality of prediction frames are obtained, screening the prediction frames with the scores larger than a given threshold value, and obtaining the score information of the prediction frames; and carrying out non-maximum suppression processing by utilizing the position information of the prediction frame and the score information of the prediction frame to obtain positioning and classification result information.
Preferably, the classification sub-network comprises 4 convolutions of 256 dimensions and 1 convolution of N × K dimensions;
the regression subnetwork comprises 4 convolutions of 256 dimensions and 1 convolution of 4 xK dimensions;
wherein, K represents the number of prior frames possessed by the input feature layer, and N represents the number of types of the objects to be detected.
Preferably, in the step 4, the total loss function adopted by the target detection model includes a classification loss function and a regression loss function; the classification loss function adopts a Focal loss function, the regression loss function adopts a Smooth loss function, and the total loss function is as follows:
wherein Loss represents the total Loss function, FL (p)
t) A function representing the loss of classification is represented,
the regression loss function is represented.
Preferably, the classification loss function is as follows:
FL(pt)=-αt(1-pt)γlog(pt)
wherein alpha istRepresents a weight coefficient, (1-p)t)γDenotes the adjustment coefficient, ptRepresenting the probability that a sample is predicted to be positive;
the definition of the regression loss function and its derivative form are as follows:
where x represents the difference between the predicted value and the true value.
One or more technical schemes provided by the invention at least have the following technical effects or advantages:
firstly, obtaining a target detection object image set, preprocessing the target detection object image set to obtain a data set, and constructing a training sample set according to the data set; then the constructed deep neural network comprises a feature extraction module, a feature fusion module and a classification and regression module; the feature extraction module is a new network structure eSE-dResNet combining a d-ResNet network and an eSEnet module; then, training a deep neural network by using a training sample set to generate a target detection model; and finally, inputting the image of the object to be detected into the target detection model to obtain a target detection result. The invention can automatically learn the target characteristics by adopting a detection method based on the deep neural network, has strong generalization capability, and can be suitable for target detection under the adverse conditions of complicated image background, high environmental noise, low contrast, uneven illumination and the like.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Example 1:
the embodiment 1 provides a target detection method based on a deep neural network, which comprises the following steps:
step 1, obtaining an image set of a target detection object.
And 2, preprocessing the target detection object image set to obtain a data set, and constructing a training sample set according to the data set.
Specifically, the preprocessing the target detection object image set includes: cutting and correcting the original image; if the original images in the target detection object image set are consistent in width and different in height, maintaining the image width unchanged, and cutting the images at different heights, wherein the cutting correction is realized by adopting the following method:
h=(w-h1)n+(n-1)h1
where h and w represent the total length and width of the original picture, respectively, h1Indicating the height of the excess rectangle after cutting out n pictures.
The preprocessing the target detection object image set further comprises: expanding the cut and corrected data set to obtain an expanded data set; and marking the target contained in the target detection image in the expanded data set by using a marking tool.
Step 3, constructing a deep neural network, wherein the deep neural network comprises a feature extraction module, a feature fusion module and a classification and regression module; the feature extraction module is a new network structure eSE-dResNet combining a d-ResNet network and an eSEnet module.
The d-ResNet network is obtained by adding two cross-layer connections in an identity block in an original ResNet50 structure; the d-ResNet network performs characteristic splicing operation on the input of the first 1 x 1 volume block, the output of the first 1 x 1 volume block and the output of the 3 x 3 volume block, and then takes the spliced result as the input of the second 1 x 1 volume block; the eSEnet module is embedded between an identity block and a conv block in the d-ResNet network; the eSEnet module replaces the original two fully-connected layers of the excitation part in the SEnet with a convolution layer with the convolution kernel size of 1.
The feature fusion module adopts a feature pyramid structure to perform feature fusion of different dimensions.
The feature extraction module comprises P1~PiI stages in total, the feature fusion module comprises Ci~CjA total of i-j +1 stages;
to PiPerforming dimensionality reduction operation on the calculation result of the stage to obtain CiThe result of the stage calculation, CiIntermediate result and P obtained after up-sampling operation is carried out on the calculation result of the stagei-1Adding intermediate results obtained after dimensionality reduction operation is carried out on the calculation results of the stages to obtain Ci-1Calculating results of the stages;
c is to bem+1Intermediate result and P obtained after up-sampling operation is carried out on the calculation result of the stagemAdding intermediate results obtained after dimensionality reduction operation is carried out on the calculation results of the stages to obtain CmCalculating results of the stages; wherein m is [ j, i-2 ]]。
The classification and regression module includes: classifying sub-networks and regressing sub-networks; obtaining a classification result through the classification sub-network, and obtaining prior frame coordinate change information through the regression sub-network; obtaining prior frame parameter information by using a k-means clustering algorithm, and obtaining predicted frame position information according to the prior frame parameter information and the prior frame coordinate change information; after a plurality of prediction frames are obtained, screening out the prediction frames with the scores larger than a given threshold value, and obtaining score information of the prediction frames; and carrying out non-maximum suppression processing by utilizing the position information of the prediction frame and the score information of the prediction frame to obtain positioning and classification result information.
The classification sub-network comprises 4 convolutions of dimension 256 and 1 convolution of dimension nxk; the regression subnetwork comprises 4 convolutions of 256 dimensions and 1 convolution of 4 xK dimensions; wherein, K represents the number of prior frames possessed by the input feature layer, and N represents the number of the types of the targets to be detected.
And 4, training the deep neural network by using the training sample set to generate a target detection model.
Specifically, the total loss function adopted by the target detection model comprises a classification loss function and a regression loss function; the classification loss function adopts a Focal loss function, the regression loss function adopts a Smooth loss function, and the total loss function is as follows:
wherein Loss represents the total Loss function, FL (p)
t) A function representing the loss of classification is represented,
the regression loss function is represented.
The classification loss function is as follows:
FL(pt)=-αt(1-pt)γlog(pt)
wherein alpha istRepresents a weight coefficient, (1-p)t)γDenotes the adjustment coefficient, ptRepresenting the probability that a sample is predicted to be positive;
the definition of the regression loss function and its derivative form are as follows:
where x represents the difference between the predicted value and the true value.
And 5, inputting the image of the object to be detected into the target detection model to obtain a target detection result.
The present invention will be further described below by taking the detection of train bottom parts as an example.
Example 2:
embodiment 2 provides a target detection method based on a deep neural network, and designs a new target detection model, which can quickly locate key components at the bottom of a train, realize multi-target classification of a plurality of key components such as axles, hooks and piston rods, reduce manual detection links, and improve detection efficiency. According to the complexity of the environment at the bottom of the locomotive, an improved d-ResNet network is designed on the basis of a residual error network ResNet50, and an eSEnet module is embedded in the improved d-ResNet network, so that the characteristic extraction performance is enhanced; meanwhile, the characteristic pyramid structure is adopted to perform characteristic fusion of different dimensions, so that the network can learn more abundant low-dimensional characteristics and high-dimensional characteristics, and vehicle bottom parts can be detected more accurately. The experimental result shows that the designed network model greatly improves the detection effect of the vehicle bottom part.
The flowchart of this embodiment is shown in fig. 1, and the specific steps are as follows:
step 1: and (6) data processing.
The data set used by the present embodiment is provided by the local railway office. The original data set is obtained by shooting and collecting high-definition linear array cameras erected on the edge of a rail, the width of each picture is 2048 pixels, the height of each picture is 29956-39956, the pictures cannot be directly input into a network for training, and the original data need to be cut and corrected. The clipping manner adopted by the embodiment is as follows:
h=(w-h1)n+(n-1)h1
where h and w represent the total length and width of the original picture, respectively, h1The method shows the height of the surplus rectangle after the n pictures are cut out, and the cutting mode is very simple and is suitable for the pictures with large length-width ratio.
The method includes the steps of keeping the width of a picture unchanged, cutting different pictures in height, cutting an input picture into 2048 x 4096 size in a unified mode for convenient calculation, expanding a data set due to limitation of an integral data set, wherein the cut data size is insufficient and the cut data size contains a part of non-target pictures, expanding the data set from original 5123 to 11747 through geometric transformation such as translation, transposition, mirroring and rotation, dividing processed data into 8037 training sets and 3710 testing sets according to proportion, detecting objects including five types of objects including an I-type axle, a II-type axle, a car logo, a hook and a piston rod, and marking the objects contained in each picture by using a marking tool.
Step 2: and generating a prior frame.
In order to improve the detection performance, 4 kinds of prior frames with different sizes suitable for the data set are obtained by using a k-means clustering algorithm before the deep neural network is trained, the sizes of the prior frames are adjusted according to different feature layers, and each feature layer can divide an input picture into grids corresponding to the length and the width of the feature layer.
It should be noted that the number of the prior frames can be adjusted for different detection objects. The types of the targets detected by the embodiment are only five, the shapes and the sizes are fixed, and for the characteristics of the data set of the embodiment, the embodiment adopts 4 prior frames.
Fig. 2 shows the arrangement of prior frames in different feature layers, and only the last two feature layers are listed here for the 5-layer output feature map of the feature fusion module because the sizes of other feature layers are too large, where fig. 2(a) shows the input picture, fig. 2(b) and fig. 2(C) show the distribution of prior frames in one of the feature layers C6 and C7, respectively, and the size of C7 feature layer is 8 × 4, so the whole picture will be divided into 8 × 4 grids, and then 4 prior frames with different shapes obtained by clustering are established with the center of each frame, and the other feature layers are the same.
And step 3: a loss function is designed.
The model training phase needs to improve the overall performance of the model by minimizing the loss function. The loss function used in this embodiment is divided into two parts, including a classification loss function and a regression loss function, and the two are combined into a total loss metric in this embodiment.
The detection model designed by the embodiment belongs to a single-stage detection model, the detection performance is improved by utilizing a prior frame, but positive and negative samples can also appearAnd the phenomenon of ratio imbalance of difficult and easy samples, the Focal loss used by the RetinaNet network is taken as a classification loss function of the model, and compared with a cross entropy loss function, the Focal loss introduces a weight coefficient alpha on the basis of the Focal loss functiontBy adjusting αtTo reduce the impact of negative examples on training. At the same time, the coefficient (1-p) is introducedt)γThe weights between the samples which are easy to classify and the samples which are difficult to classify are adjusted, and the contribution of the samples which are difficult to classify to the loss value is increased. The loss function is defined as follows:
FL(pt)=-αt(1-pt)γlog(pt)
wherein p istRepresents the probability that the sample is predicted to be positive, when the value of gamma is 2, alphatThe experimental results are optimal when the value is 0.25.
The regression loss function is a Smooth loss function, and the definition of the loss function and the derivative form thereof are shown as follows:
where x represents the difference between the predicted value and the true value. The Smooth loss function can limit the gradient size, and combines the advantages of L1 loss and L2 loss, so that the loss function has a derivative at the 0 point, and the network is more robust. It can be seen from the derivative formula of the smoothing loss function that when the difference between the prediction frame and the actual frame is too large, the gradient is not too large, and when the difference between the prediction frame and the actual frame is smaller, the sufficiently small gradient can be ensured.
The overall loss function is shown below:
and 4, step 4: and inputting the data set into a deep neural network for training.
Inputting the training set obtained in the step 1 into the network in batches for training. In the training process, 50 rounds of training are carried out on data, due to the fact that the picture size is too large and limited by a memory, the number of pictures input into deep neural network training each time is 2, the number of iterations is 200000, an Adam optimizer is adopted in the network, the initial learning rate of the network is set to be 1 x 10-4。
The deep neural network framework is shown in fig. 3, and the whole deep neural network is divided into three modules:
(1) a feature extraction module:
in the embodiment, d-Resnet combined with eSEnet improved on the basis of ResNet50 is used as a feature extraction module, and the module I has 56 layers and is divided into 7 stages of P1-P7 (see figure 1). In order to increase the richness and accuracy of feature extraction, this embodiment adds two cross-layer connections to the identity block (constant block) in the original ResNet50 structure, as shown in fig. 4, the original identity block is composed of two convolution blocks with a size of 1 × 1 and one convolution block with a size of 3 × 3, the modified identity block performs a splicing operation (see C connection in fig. 4) on the input of the first 1 × 1 convolution block, the output of the first 1 × 1 convolution block and the output of the 3 × 3 convolution block, then the spliced result is used as the input of the second 1 × 1 convolution block, and then the convolution operation is performed to perform different feature layers, so that the enhanced splicing multiple extraction of different features is realized, the overall effect is improved, which is referred to as dense-ResNet (d-ResNet for short), and in order to fully consider the relationship between feature channels, to enable the network to extract more valuable features, eSENet modules are embedded in each identity block and conv block (connection block), the combination of d-ResNet and eSENet being as shown in fig. 4.
The eSENet module is an improvement on the basis of a SeNet (Squeeze-and-Excitation network), and is as same as the SENet, the eSENet module is divided into a compression part and an Excitation part, a feature channel is fused in a feature recalibration mode, the compression part adopts self-adaptive global pooling operation to compress an input with dimension C of W × H to an output with dimension C of 1 × 1, and the output feature fuses global information. SENET scales characteristic dimensionality through two full-connection layers, the input with the dimensionality of C is changed into output with the dimensionality of C/r by the first full-connection layer through a parameter r, then the output is restored to the initial dimensionality through the second full-connection layer, information loss can be caused by dimensionality reduction operation in the process, eSENET replaces the original two full-connection layers of an excitation part with convolution with the convolution kernel size of 1, information loss is reduced to a certain extent, meanwhile, the calculated amount is reduced, the operation efficiency of a deep neural network is improved, and the eSEN structure is shown in figure 5.
(2) A feature fusion module:
the feature fusion module fuses the calculation results of the feature extraction module, and enhances the detection effect of the deep neural network on objects with different sizes by adding feature channels with different resolutions and different semantic information. Firstly, performing one-time dimensionality reduction operation on the calculation result of the stage P7 to obtain C7, changing the characteristic dimensionality from 8 × 4 × 2048 to 8 × 4 × 256, then performing characteristic upsampling operation on C7, changing the dimensionality of C7 from 8 × 4 × 256 to 16 × 8 × 256, finally performing dimensionality reduction operation on P6, and adding the operation result and the upsampling result of C7 to obtain C6. In a similar way, the number of characteristic layers of P5-P3 is reduced to 256 layers through dimensionality reduction operation, and then the characteristic layers are added with the result of sampling at the previous layer respectively to correspondingly obtain results of C5-C3. The feature fusion is only to add cross-layer connection on the basis of the feature extraction module, so that the parameter quantity cannot be increased while the model effect is improved, and the small increase of the calculated quantity can be ignored.
(3) Classification and regression:
the deeper the deep neural network is, the more serious the spatial information of the extracted features is lost, the influence on the effect of feature fusion can be caused, the depth of the network is insufficient, the semantic information of the extracted features is not rich enough, the detection effect on a large target is not good, and experiments show that the detection effect of the 5-type features adopted in the embodiment is optimal.
After feature fusion, 5 types of feature layers with different sizes and the same dimensionality can be obtained, and the 5 types of feature layers are processed by a classification sub-network and a regression sub-network to obtain a detection result. The classification sub-network comprises 4 convolutions with 256 dimensionalities and 1 convolution with N multiplied by K dimensionalities, wherein K refers to the number of prior frames possessed by an input feature layer, N refers to the number of types of targets to be detected, and classification results are output after the features are subjected to the N multiplied by K convolution. The regression subnetwork comprises 4 convolutions with 256 dimensionalities and 1 convolution with 4 multiplied by K dimensionalities, the output result is the change situation of the coordinate of each prior frame, and the prior frames are combined with the change situations to obtain the position information of the prediction frame. After classification and regression network processing, a plurality of prediction frames are obtained, the prediction frames with the scores larger than a given threshold value are screened out, and NMS (non-maximum suppression) processing is carried out by utilizing the position information and the scores of the frames to obtain a final detection result.
The target detection method based on the deep neural network provided by the embodiment of the invention at least comprises the following technical effects:
(1) the traditional target detection algorithm has the advantages that the manually designed characteristics have no good robustness to the change of diversity, the detection algorithm based on the deep neural network can automatically learn the characteristics of the target, the generalization capability is strong, and the method can be suitable for more scenes.
(2) The invention improves on the basis of a ResNet network, designs a d-ResNet network, combines an eSEnet module as a feature extraction module, and compared with other feature extraction modules, the module introduces intensive connection in a residual error module, realizes the reinforced multiple extraction of different features, has better feature extraction performance and brings less calculation amount.
(3) An attention mechanism is introduced into a feature extraction module, a feature weight calibration method is adopted for fusion among feature channels, the weight of the feature channels is obtained through self learning and distributed, the weight of the useful feature channels is improved, and the weight of the feature channels with small correlation is weakened.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to examples, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.