CN112364721A

CN112364721A - Road surface foreign matter detection method

Info

Publication number: CN112364721A
Application number: CN202011147589.XA
Authority: CN
Inventors: 赵巧芝; 岳庆冬
Original assignee: Xian Cresun Innovation Technology Co Ltd
Current assignee: Xian Cresun Innovation Technology Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-02-12

Abstract

The invention discloses a road surface foreign matter detection method, which comprises the following steps: acquiring a target pavement image to be detected; inputting a target road surface image into an improved YOLOv3 network obtained by pre-training, and performing feature extraction by using a backbone network in a dense connection form to obtain x feature maps with different scales, wherein x is a natural number greater than or equal to 4; carrying out feature fusion on the feature graphs of x different scales in a top-down and dense connection mode by using an FPN network to obtain a prediction result corresponding to each scale; and processing the prediction result through a classification network and a non-maximum suppression module to obtain the recognition result of each target in the target pavement image, wherein the recognition result comprises the category and the position of the target. The invention can improve the detection precision; foreign matters with different scales can be detected, and the types of the foreign matters can be accurately detected; the detection speed is improved, and real-time detection is realized.

Description

Road surface foreign matter detection method

Technical Field

The invention belongs to the field of target detection, and particularly relates to a road surface foreign matter detection method.

Background

An airport runway Foreign Object refers to any "small Object" (FOD) that does not belong to the runway working area but appears on the runway for various reasons. FODs include engine attachments (nuts, screws, washers, fuses, etc.), machine tools, flying objects (nails, personal documents, pens, pencils, etc.), wildlife, foliage, stones and sand, pavement material, wood blocks, plastic or polyethylene material, paper products, ice from travel areas, etc. FOD can be easily drawn into the engine, causing engine failure. Debris can also accumulate in the mechanical device and interfere with the proper operation of the landing gear, flaps, etc. The foreign objects not only cause huge direct loss, but also cause indirect loss such as flight delay, take-off interruption, runway closing and the like, and the indirect loss is at least 4 times of the direct loss.

At present, FOD monitoring of most airports in the world is still finished manually, and the method is low in efficiency and occupies precious runway use time; and the detection capability of the minute object is poor, such as the minute object cannot be detected, or the classification of the minute object is inaccurate, etc.

Therefore, it is urgently needed to provide a road surface foreign matter detection method for realizing high-precision and high-reliability detection of a tiny target.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a road surface foreign object detection method, apparatus, electronic device, and storage medium. The technical problem to be solved by the invention is realized by the following technical scheme:

in a first aspect, the present invention provides a method for detecting a foreign object on a road surface, comprising:

acquiring a target pavement image to be detected;

inputting the target road surface image into an improved YOLOv3 network obtained by pre-training, and performing feature extraction by using a backbone network in a dense connection form to obtain x feature maps with different scales; x is a natural number of 4 or more;

performing feature fusion on the x feature maps with different scales in a top-down and dense connection mode by using an FPN network to obtain a prediction result corresponding to each scale;

processing all prediction results through a classification network and a non-maximum suppression module to obtain an identification result of each target in the target pavement image, wherein the identification result comprises the category and the position of the target;

the improved YOLOv3 network comprises a backbone network, an FPN network, a classification network and a non-maximum suppression module which are connected in sequence; the improved YOLOv3 network is formed by replacing a residual error module in a main network with a dense connection module, increasing a feature extraction scale and optimizing a feature fusion mode of an FPN network on the basis of a YOLOv3 network; the improved YOLOv3 network is trained according to a sample image and the category and the position of each target in the sample image.

In one embodiment of the present invention, the backbone network in a dense connection form includes:

dense connection modules and transition modules which are connected in series at intervals; the number of the dense connection modules is y; the dense connection module comprises a convolution network module and a dense connection unit group which are connected in series; the convolution network module comprises a convolution layer, a BN layer and a Leaky relu layer which are connected in series; the dense connection unit group comprises m dense connection units; each dense connection unit comprises a plurality of convolution network modules which are connected in a dense connection mode, and feature graphs output by the convolution network modules are fused in a cascading mode; wherein y is a natural number of 4 or more, and m is a natural number of 1 or more.

In an embodiment of the present invention, the extracting features by using a backbone network to obtain x feature maps with different scales includes:

and performing feature extraction on the target road surface image by utilizing y dense connection modules which are connected in series to obtain x feature maps which are output by the x dense connection modules in the reverse direction along the input direction and have sequentially increased scales.

In one embodiment of the invention, the transition module comprises the convolutional networking module and a max-pooling layer; the input of the convolution network module and the input of the maximum pooling layer are shared, and the feature graph output by the convolution network module and the feature graph output by the maximum pooling layer are fused in a cascading mode.

In an embodiment of the present invention, the number of the convolutional network modules included in the transition module is two or three, and a serial connection manner is adopted between each convolutional network module.

In one embodiment of the invention, the FPN network comprises x prediction branches Y of successively increasing size₁～Y_x(ii) a Wherein the prediction branch Y₁～Y_xThe scales of the x feature maps correspond to the scales of the x feature maps one by one;

the method for performing feature fusion from top to bottom in a dense connection mode by using the FPN network on the feature maps with the x different scales comprises the following steps:

for predicted branch Y_iObtaining the characteristic diagram with corresponding scale from the x characteristic diagrams, performing convolution processing, and comparing the feature diagram after convolution processing with the prediction branch Y_i-1～Y₁Performing cascade fusion on the feature maps subjected to the upsampling treatment respectively; wherein branch Y is predicted_i-jHas an upsampling multiple of 2^j(ii) a i is 2, 3, …, x; j is a natural number smaller than i.

In an embodiment of the present invention, before training the modified YOLOv3 network, the method further includes:

determining the number to be clustered for the anchor box size in the sample image;

acquiring a plurality of sample images with marked target frame sizes;

based on a plurality of sample images marked with the size of the target frame, acquiring a clustering result of the size of the anchor box in the sample images by using a K-Means clustering method;

writing the clustering result into a configuration file of the improved YOLOv3 network.

In a second aspect, the present invention provides a road surface foreign matter detection apparatus, comprising:

the acquisition module is used for acquiring a target pavement image to be detected;

the feature extraction module is used for inputting the target road surface image into an improved YOLOv3 network obtained by pre-training, and extracting features by using a backbone network in a dense connection form to obtain x feature maps with different scales; x is a natural number of 4 or more;

the characteristic fusion module is used for carrying out characteristic fusion on the x characteristic graphs with different scales in a top-down and dense connection mode by using an FPN network to obtain a prediction result corresponding to each scale;

the classification and NMS module is used for processing all prediction results through a classification network and a non-maximum suppression module to obtain the recognition result of each target in the target pavement image, and the recognition result comprises the category and the position of the target;

In a third aspect, the present invention provides an electronic device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

a processor for implementing any of the above method steps when executing a program stored in the memory.

In a third aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method steps of any of the above.

The invention has the beneficial effects that:

according to the method, the residual modules in the backbone network of the YOLOv3 network are replaced by the dense connection modules, and the feature fusion mode is changed from parallel to serial, so that when feature extraction is carried out on the backbone network, the early feature graph can be directly used as the input of each layer behind, the obtained feature graph has more information content, and the feature transmission is enhanced, so that the detection precision can be improved when airport pavement foreign object detection is carried out.

The invention transmits the feature maps from shallow to deep, extracts feature maps of at least four scales, enables a network to detect foreign matters of different scales, especially tiny foreign matters, by increasing feature extraction scales of fine granularity, and simultaneously realizes accurate classification of the foreign matters.

The invention carries out feature fusion from top to bottom in a dense connection mode by utilizing the FPN network, directly carries out upsampling on deep features by different multiples so as to ensure that all feature graphs transmitted have the same size, fuses the feature graphs and shallow feature graphs in a cascading mode, can utilize more original information, has participation of high-dimensional semantic information in the shallow network and is beneficial to improving the detection precision; meanwhile, more specific characteristics can be obtained by directly receiving the characteristics of a shallower network, the loss of the characteristics can be effectively reduced, the parameter quantity needing to be calculated can be reduced, the detection speed is improved, and real-time detection is realized.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

Fig. 1 is a schematic flow chart of a road surface foreign matter detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a prior art YOLOv3 network;

fig. 3 is a schematic structural diagram of an improved YOLOv3 network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a transition module according to an embodiment of the present invention;

FIG. 5-1 is a graph comparing the mAP curves of YOLOv3 and Dense-YOLO-1 of the present invention; FIG. 5-2 is a comparison of the loss curves of YOLOv3 and Dense-YOLO-1 of the present invention;

FIG. 6-1 is a graph comparing the mAP curves of Dense-YOLO-1 and MultiScale-YOLO-1 according to an embodiment of the present invention; FIG. 6-2 is a comparison of the loss curves of Dense-YOLO-1 and MultiScale-YOLO-1 in accordance with an embodiment of the present invention;

FIG. 7-1 is a graph comparing the mAP curves of Dense-YOLO-1 and Dense-YOLO-2 according to an embodiment of the present invention;

FIG. 7-2 is a comparison graph of the loss curves of Dense-YOLO-1 and Dense-YOLO-2 in accordance with the present invention;

FIG. 8-1 is a graph comparing the mAP curves of Dense-YOLO-1 and MultiScale-YOLO-2 according to an embodiment of the present invention; FIG. 8-2 is a comparison of the loss curves of Dense-YOLO-1 and MultiScale-YOLO-2 according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a road surface foreign matter detection device according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

The embodiment of the invention provides a road surface foreign matter detection method and device, electronic equipment and a storage medium.

It should be noted that the main implementation body of the method for detecting foreign objects on a road surface provided by the embodiment of the present invention may be a device for detecting foreign objects on a road surface, and the device for detecting foreign objects on a road surface may be operated in an electronic device. The electronic device may be disposed in a monitoring device such as a radar or an unmanned aerial vehicle, but is not limited thereto.

In a first aspect, an embodiment of the present invention provides a method for detecting a foreign object on a pavement. Next, the method for detecting road surface foreign matter will be described first.

As shown in fig. 1, a method for detecting a foreign object on a road surface according to an embodiment of the present invention may include the following steps:

s1, acquiring a target road surface image to be detected;

the target pavement image is an image shot by image acquisition equipment aiming at a to-be-detected area; the image acquisition device may be a high resolution photoelectric monitoring device that may be located in a lighthouse, sidelight, or on the roof of a moving vehicle at an airport.

The image acquisition device may include a high resolution camera, a video camera, and the like.

In the embodiment of the present invention, the size of the road surface image to be detected is required to be 416 × 416 × 3.

Thus, at this step, in one embodiment, a road surface image of 416 × 416 × 3 size may be directly obtained; in another embodiment, an image of any size may be obtained, and the obtained image is subjected to a certain size scaling process to obtain a road surface image of 416 × 416 × 3 size.

In the two embodiments, the obtained image may be subjected to image enhancement operations such as cropping, stitching, smoothing, filtering, edge filling, and the like, so as to enhance features of interest in the image and expand the generalization capability of the data set.

S2, inputting the target road surface image into an improved YOLOv3 network obtained by pre-training, and extracting features by using a backbone network in a dense connection form to obtain x feature maps with different scales; x is a natural number of 4 or more;

to facilitate understanding of the network structure of the improved YOLOv3 network proposed in the embodiment of the present invention, first, a network structure of a YOLOv3 network in the prior art is introduced, please refer to fig. 2, and fig. 2 is a schematic structural diagram of a YOLOv3 network in the prior art. In fig. 2, the part inside the dashed box is the YOLOv3 network. Wherein the part in the dotted line frame is a backbone (backbone) network of the YOLOv3 network, namely a darknet-53 network; the backbone network of the YOLOv3 network is formed by connecting CBL modules and 5 resn modules in series. The CBL module is a Convolutional network module, and includes a conv layer (convolutive layer, convolutive layer for short), a BN (Batch Normalization) layer and an leakage relu layer corresponding to an activation function leakage relu, which are connected in series, and the CBL represents conv + BN + leakage relu. The resn module is a residual error module, n represents a natural number, and specifically, as shown in fig. 2, res1, res2, res8, res8, and res4 are sequentially arranged along the input direction; the resn module comprises a zero padding (zero padding) layer, a CBL module and a Residual error unit group which are connected in series, the Residual error unit group is represented by Res unit n, the Residual error unit group comprises n Residual error units, each Residual error unit comprises a plurality of CBL modules which are connected in a Residual error Network (ResNet) connection mode, and the feature fusion mode adopts a parallel mode, namely an add mode.

The rest of the network outside the main network is a Feature Pyramid (FPN) network, which is divided into three prediction branches Y₁～Y₃Predicting branch Y₁～Y₃The scales of (2) are in one-to-one correspondence with the scales of the feature maps output by the 3 residual error modules res4, res8, res8 in the reverse direction of the input, respectively. The prediction results of the prediction branches are respectively represented by Y1, Y2 and Y3, and the scales of Y1, Y2 and Y3 are increased in sequence.

Each prediction branch of the FPN network includes a convolutional network module group, specifically includes 5 convolutional network modules, that is, CBL × 5 in fig. 2. In addition, the US (up sampling) module is an up sampling module; concat represents that the feature fusion adopts a cascade mode, and concat is short for concatenate.

For the specific structure of each main module in the YOLOv3 network, please refer to the schematic diagram below the dashed box in fig. 2.

In the embodiment of the invention, the improved YOLOv3 network comprises a backbone network in a dense connection form, an FPN network, a classification network and a non-maximum suppression module; the improved YOLOv3 network is formed by replacing a residual error module in a main network with a dense connection module, increasing a feature extraction scale and optimizing a feature fusion mode of an FPN network on the basis of a YOLOv3 network; the improved Yolov3 network is trained according to the sample pavement image and the position and the type of the target corresponding to the sample pavement image. The network training process is described later.

To facilitate understanding of the present invention, the structure of the modified YOLOv3 network will be described first, and the backbone network part will be described first.

Fig. 3 shows a structure of an improved YOLOv3 network according to an embodiment of the present invention, where fig. 3 is a schematic structural diagram of an improved YOLOv3 network according to an embodiment of the present invention; in fig. 3, it can be seen that the backbone network has changed, see the part inside the dotted box in fig. 3.

Compared with the backbone network of the YOLOv3 network, the backbone network of the improved YOLOv3 network provided by the embodiment of the present invention has the improved idea that, on one hand, a specific dense connection module is proposed by taking the connection mode of the dense convolutional network densnet as a reference, and is used to replace a residual module (resn module) in the backbone network of the YOLOv3 network. Namely, the backbone network of the improved YOLOv3 network adopts a backbone network in a dense connection form, and it is known that ResNets combine features by summation before transferring the features to an image layer, namely, feature fusion is performed in a parallel manner. Whereas the dense connection approach connects all layers (with matching signature sizes) directly to each other in order to ensure that information flows to the maximum extent between layers in the network. Specifically, for each layer, all feature maps of its previous layer are used as its input, and its own feature map is used as its input for all subsequent layers, i.e., feature fusion is in a cascade (also referred to as a cascade). Therefore, compared with a YOLOv3 network which uses a residual module, the improved YOLOv3 network obtains more information quantity of feature maps by using a dense connection module instead, and can enhance feature propagation and improve detection precision when detecting a pavement image. Meanwhile, because the redundant characteristic diagram does not need to be learned again, the number of parameters can be greatly reduced, the calculated amount is reduced, and the problem of gradient disappearance can be reduced. On the other hand, the embodiment of the invention transfers the feature maps from shallow to deep, extracts the feature maps with at least four scales, enables the network to detect objects with different scales, and can improve the detection precision for the tiny target by increasing the feature extraction scale with fine granularity during the subsequent target detection.

For example, referring to fig. 3, the backbone network in the form of dense connection may include:

dense connection modules and transition modules which are connected in series at intervals; the densely connected modules are denoted denm in fig. 3. The number of the dense connection modules is y; the dense connection module comprises a convolution network module and a dense connection unit group which are connected in series; the convolution network module comprises a convolution layer, a BN layer and a Leaky relu layer which are connected in series; the dense connecting unit group comprises m dense connecting units; each dense connection unit comprises a plurality of convolution network modules connected in a dense connection mode, and a characteristic diagram output by the convolution network modules is fused in a cascading mode; wherein y is a natural number of 4 or more, and m is a natural number of 1 or more.

As an example, the number of densely connected modules in fig. 3 is 5, and the precision of the improved yollov 3 network formed by 5 densely connected modules is higher than that of 4 densely connected modules.

Convolutional network module, denoted CBL as before; the dense group of connected units is denoted den unit x m, which means that the dense group of connected units comprises m dense connected units, m may be 2. Each densely connected unit is denoted as den unit; the method comprises a plurality of convolution network modules which are connected in a dense connection mode, and a cascade mode is adopted to fuse characteristic graphs output by the convolution network modules, the cascade mode is concat, meaning is tensor splicing, the operation is different from the add operation in a residual error module, the concat can expand the dimensionality of the tensor, and the add is only directly added without causing the change of the tensor dimensionality. Therefore, when feature extraction is performed on the backbone network of the improved YOLOv3 network, the dense connection module is used to change the feature fusion mode from parallel to serial, so that the early feature graph can be directly used as the input of each subsequent layer, the feature transfer is strengthened, and the number of parameters and the computation amount are reduced by multiplexing the feature graph parameters of the shallow network.

In the embodiment of the invention, the trunk network in the dense connection form extracts at least 4 scales of feature maps to perform feature fusion of subsequent prediction branches, so that the number y of dense connection modules is more than or equal to 4, and feature maps output by the trunk network are correspondingly fused into each prediction branch. It can be seen that the improved YOLOv3 network obviously adds at least one finer-grained feature extraction scale in the backbone network compared with the YOLOv3 network. Referring to fig. 3, compared with the YOLOv3 network, the feature map output by the fourth densely connected module along the reverse direction of the input is extracted for subsequent feature fusion. Therefore, the backbone network in the dense connection form respectively outputs corresponding feature maps along four dense connection modules with reverse input directions, and the scales of the four feature maps are sequentially increased. Specifically, the scale of each feature map is 13 × 13 × 72, 26 × 26 × 72, 52 × 52 × 72, and 104 × 104 × 72, respectively.

Of course, in an alternative embodiment, five feature extraction scales may be set, that is, the feature map output by the fifth densely connected module with the extraction direction reversed along the input direction is added for subsequent feature fusion, and so on.

Specifically, for the step S2, obtaining x feature maps with different scales includes:

and obtaining x characteristic graphs which are output by the x dense connection modules along the input reverse direction and have sequentially increased scales.

Referring to fig. 3, a feature map output by the first densely-connected module to the fourth densely-connected module in the reverse direction of the input is obtained, and the sizes of the four feature maps are sequentially increased.

In the embodiment of the present invention, for the structure of the transition module:

in an optional first embodiment, the transition module is a convolutional network module. I.e. using the CBL module as a transition module. Then, when a backbone network of the improved YOLOv3 network is built, the residual module is only required to be replaced by the dense connection module, and then the dense connection module and the original CBL module are connected in series to obtain the target. Therefore, the network building process is quicker, and the obtained network structure is simpler. However, such a transition module only uses convolution layers for transition, that is, the dimension of the feature map is reduced by directly increasing the step size, and this only takes care of the features in the local region, but cannot combine the information of the whole feature map, so that the information in the feature map is lost more.

In a second optional embodiment, the transition module comprises a convolution network module and a maximum pooling layer; the input of the convolution network module and the input of the maximum pooling layer are shared, and the characteristic diagram output by the convolution network module and the characteristic diagram output by the maximum pooling layer are fused in a cascading mode. Referring to fig. 4, a structure of a transition module in this embodiment is shown, and fig. 4 is a schematic structural diagram of a transition module according to an embodiment of the present invention. In this embodiment, the transition module is represented by a tran module, and the MP layer is a max pooling layer (Maxpool, abbreviated MP, meaning max pooling). Further, the step size of the MP layer may be selected to be 2. In this embodiment, the introduced MP layer can perform dimension reduction on the feature map with a larger receptive field; the used parameters are less, so that the calculated amount is not increased too much, the possibility of overfitting can be weakened, and the generalization capability of the network model is improved; and the original CBL module is combined, so that the characteristic diagram can be viewed as being subjected to dimension reduction from different receptive fields, and more information can be reserved.

For the second embodiment, optionally, the number of the convolution network modules included in the transition module is two or three, and a serial connection manner is adopted between each convolution network module. Compared with the method using one convolution network module, the method using two or three convolution network modules connected in series can increase the complexity of the model and fully extract the features.

S3, performing feature fusion in a top-down and dense connection mode on the x feature graphs with different scales by using an FPN (field programmable gate array) network to obtain a prediction result corresponding to each scale;

referring to fig. 3, the rest except the trunk network, the classification network and the non-maximum suppression module is an FPN (Feature Pyramid Networks) network including x prediction branches Y with sequentially increasing scales₁～Y_x(ii) a Wherein the prediction branch Y₁～Y_xThe scales of the x feature maps correspond to the scales of the x feature maps one by one; see fig. 2, i.e. prediction branch Y₁～Y_xRespectively corresponding to the scales of the feature maps output by the 4 dense connection modules along the input reverse direction.

The feature fusion of the top-down and dense connection mode is carried out on the feature graphs of the x different scales by using an FPN network, and the method comprises the following steps:

for predicted branch Y_iObtaining the characteristic diagram with corresponding scale from the x characteristic diagrams, performing convolution processing, and comparing the feature diagram after convolution processing with the prediction branch Y_i-1～Y₁Respectively up-sampledCarrying out cascade fusion on the processed characteristic graphs; wherein branch Y is predicted_i-jHas an upsampling multiple of 2^j(ii) a i is 2, 3, …, x; j is a natural number smaller than i.

As understood with reference to fig. 3; i is 3, i.e. the predicted branch Y₃For illustration, the feature maps for performing the cascade fusion process are derived from three aspects: on the first hand, from the 4 feature maps, the feature map with the corresponding scale is obtained and is subjected to convolution processing, that is, the feature map output by the third dense connection module res8 along the input reverse direction is subjected to CBL module, and the feature map can also be understood as being subjected to 1-time upsampling and has the size of 52 × 52 × 72; the second aspect derives from predicting branch Y₂(i.e. Y)_i-1＝Y₂) I.e. the characteristic diagram (size 26 × 26 × 72) output by the second densely-connected module res8 in the reverse direction of the input passes through the predicted branch Y₂The CBL module of (2)¹2 times the feature map after upsampling (size 52 × 52 × 72); the third aspect derives from the predicted branch Y₁(i.e. Y)_i-2＝Y₁) I.e. the characteristic map (size 13 × 13 × 72) output by the first densely-connected module res4 in the reverse direction of the input is predicted for branch Y₁The CBL module of (2) is then passed²4 times the feature map after upsampling (size 52 × 52 × 72); then, as will be understood by those skilled in the art, after the above-mentioned process performs upsampling processing on three feature maps with different scales output by the backbone network by different multiples, the sizes of the three feature maps to be cascaded and fused can be made to be consistent, and all the three feature maps are 52 × 52 × 72. Thus, branch Y is predicted₃After cascade fusion, convolution and other processes can be continued to obtain a prediction result Y3, wherein the size of Y3 is 52 × 52 × 72.

For the feature fusion process of the remaining prediction branches, please refer to prediction branch Y₃And will not be described herein. For the predicted branch Y₁And the subsequent prediction process is carried out by the first intensive connection module after the characteristic diagram output by the first intensive connection module along the input reverse direction is obtained, and the characteristic diagrams of other prediction branches are not received to be fused with the characteristic diagrams.

The embodiment adopts a fusion method of dense connection, that is, deep features are directly subjected to upsampling with different multiples, so that all feature maps transmitted have the same size. The feature maps and the shallow feature map are fused in a cascading mode, features are extracted again from the fused result to eliminate noise inside, main information is reserved, and then prediction is carried out, so that more original information can be utilized, and high-dimensional semantic information participates in a shallow network. The detection precision is improved; meanwhile, more specific characteristics can be obtained by directly receiving the characteristics of a shallower network, the loss of the characteristics can be effectively reduced, the parameter quantity needing to be calculated can be reduced, the detection speed is improved, and real-time detection is realized.

In this step, a feature fusion method is mainly described, each prediction branch is mainly predicted by using some convolution operations after feature fusion, and for how to obtain a respective prediction result, reference is made to related prior art, and no description is made here. The prediction results of the prediction branches are respectively Y1-Y4, the sizes of which are identified under the names, and the network structure diagram of fig. 3 is specifically referred to for understanding.

S4, processing all prediction results through a classification network and a non-maximum suppression module to obtain the recognition result of each target in the target road surface image, wherein the recognition result comprises the category and the position of the target;

for each target, the detection result is in the form of a vector, including the position of the prediction box, the confidence of the target in the prediction box, and the category of the target in the prediction box. The position of the prediction frame is used for representing the position of the target in the target road surface image; specifically, the position of each prediction frame is represented by four values, bx, by, bw and bh, bx and by are used for representing the position of the center point of the prediction frame, and bw and bh are used for representing the width and height of the prediction frame.

The categories of objects are connectors (nuts, screws, washers, fuses, etc.), machine tools, flying objects (nails, personal documents, pens, pencils, etc.), wildlife, leaves, stones and sand, pavement material, pieces of wood, plastic or polyethylene material, paper products, ice ballast of a running area, etc.

Optionally, the classification network may be a SoftMax classifier, or may perform classification by using logistic regression, so as to implement classification of the detection result.

The non-maximum suppression module is configured to perform NMS (non _ max _ suppression) processing for excluding a detection frame with a relatively small confidence from among a plurality of detection frames repeatedly framing the same object.

For the processing procedure of the classification network and the non-maximum suppression module, please refer to the related prior art, which is not described herein.

In fig. 3, the feature maps of four scales, namely 13 × 13 × 72, 26 × 26 × 72, 52 × 52 × 72, and 104 × 104 × 72, are output by 4 prediction branches, and the feature map of the smallest 13 × 13 × 72 is suitable for large target detection because the receptive field is the largest; the medium 26 × 26 × 72 feature map is suitable for detecting medium-sized targets due to the medium receptive field; the larger 52X 72 characteristic map is suitable for detecting smaller targets due to the smaller receptive field; the largest 104X 72 feature map is suitable for detecting smaller targets because the feature map has a smaller receptive field. The embodiment of the invention has more fine image division and the prediction result has more pertinence to objects with smaller sizes.

Hereinafter, the pre-training process and the training process of the improved YOLOv3 network will be briefly described.

The method comprises the following steps that (1) a specific network structure is built, improvement can be carried out on the basis of a YOLOv3 network, a residual error module in a main network is replaced by a dense connection module, the feature extraction scale is increased, the feature fusion mode of an FPN network is optimized, and a transition module is improved to obtain the network structure shown in the figure 3 and serve as the built network; wherein m is 4.

And (II) obtaining a plurality of sample pavement images and the positions and the types of the targets corresponding to the sample pavement images. In this process, the position and the category of the target corresponding to each sample road image are known, and the manner of determining the position and the category of the target corresponding to each sample road image may be: by manual recognition, or by other image recognition tools, and the like. Afterwards, the sample pavement image needs to be marked, an artificial marking mode can be adopted, and other artificial intelligence methods can be utilized for non-artificial marking, which is reasonable. The position of the target corresponding to each sample pavement image is marked in the form of a target frame containing the target, the target frame is real and accurate, and each target frame is marked with coordinate information so as to embody the position of the target in the image.

(III) determining the size of an anchor box in the sample pavement image; may include the steps of:

a) determining the quantity to be clustered aiming at the size of the anchor boxes in the sample pavement image;

in the field of target detection, an anchor box (anchor box) is a plurality of boxes with different sizes obtained by statistics or clustering from real boxes (ground route) in a training set; the anchor box actually restrains the predicted object range and adds the prior experience of the size, thereby realizing the aim of multi-scale learning. In the embodiment of the present invention, since a finer-grained feature extraction scale is desired to be added, the sizes of the labeled target frames (i.e., real frames) in the sample road surface image need to be clustered by using a clustering method, so as to obtain a suitable anchor box size suitable for the scene of the embodiment of the present invention.

Wherein the determining the number to be clustered for the size of the anchor box in the sample pavement image comprises:

determining the number of types of the anchor box size corresponding to each scale; and taking the product of the number of the types of the anchor box sizes corresponding to each scale and the x as the quantity to be clustered of the anchor box sizes in the sample road surface image.

Specifically, in the implementation of the present invention, the number of types of the anchor box size corresponding to each scale is selected to be 3; taking 4 scales as an example, the number of anchor boxes to be clustered in the obtained sample road surface image is 3 × 4 — 12.

b) Acquiring a plurality of sample images with marked target frame sizes;

this step is actually to obtain the size of each target frame in the sample image.

c) Based on a plurality of sample images marked with the size of the target frame, acquiring a clustering result of the size of the anchor box in the sample images by using a K-Means clustering method;

specifically, the size of each target frame can be clustered by using a K-Means clustering method to obtain a clustering result of the size of the anchor box; no further details regarding the clustering process are provided herein.

Wherein, the definition of the distances of different anchor boxes is the Euclidean distance of the width and the height:

wherein d is_1,2Representing the Euclidean distance, w, of the two anchor boxes₁，w₂Width, h, of the anchor box₁，h₂Representing the height of the anchor box.

For the number of clusters to be clustered being 12, the anchor box size of each predicted branch can be obtained.

d) Writing the clustering result into a configuration file of the improved YOLOv3 network.

As will be understood by those skilled in the art, the clustering result is written into the configuration file of each predicted branch of the improved YOLOv3 network according to the anchor box size corresponding to different predicted branches, and then network training may be performed.

And data in a VOC format or a COCO format is needed for network training, and the marked data is stored in a text document. A Python script is required to perform the conversion of the data set markup format.

(IV) training the network shown in FIG. 3 by using each sample pavement image and the category and position of each target in each sample pavement image, comprising the following steps:

1) and (3) taking the position and the type of the target corresponding to each sample road surface image as the corresponding true value of the sample road surface image, and training each sample road surface image and the corresponding true value through the network shown in the figure 3 to obtain the training result of each sample road surface image.

2) And comparing the training result of each sample road surface image with the true value corresponding to the sample road surface image to obtain the output result corresponding to the sample road surface image.

3) And calculating the loss value of the network according to the output result corresponding to each sample road surface image.

4) And adjusting parameters of the network according to the loss value, and repeating the steps 1) -3) until the loss value of the network reaches a certain convergence condition, namely the loss value reaches the minimum value, which means that the training result of each sample pavement image is consistent with the true value corresponding to the sample pavement image, thereby completing the training of the network.

According to the embodiment of the invention, the residual error module in the backbone network of the YOLOv3 network is replaced by the dense connection module, and the feature fusion mode is changed from parallel to serial, so that the early feature map can be directly used as the input of each layer behind when the backbone network carries out feature extraction, the obtained information content of the feature map is more, and the transfer of features is strengthened, therefore, the detection precision can be improved when airport pavement detection is carried out.

The embodiment of the invention transmits the feature maps from shallow to deep, extracts the feature maps with at least four scales, enables the network to detect foreign matters with different scales, especially tiny foreign matters, by increasing the feature extraction scale with fine granularity, and simultaneously realizes the accurate classification of the foreign matters.

The embodiment of the invention carries out feature fusion in a top-down and dense connection mode by utilizing the FPN network, directly carries out upsampling on deep features by different multiples so as to ensure that all feature graphs transmitted have the same size, fuses the feature graphs and shallow feature graphs in a cascading mode, can utilize more original information, has participation of high-dimensional semantic information in the shallow network, and is favorable for improving the detection precision; meanwhile, more specific characteristics can be obtained by directly receiving the characteristics of a shallower network, the loss of the characteristics can be effectively reduced, the parameter quantity needing to be calculated can be reduced, the detection speed is improved, and real-time detection is realized.

In the embodiment of the invention, a residual error module in a backbone network of a YOLOv3 network is replaced by a Dense connection module, and the network behind a transition module is improved and named as Dense-YOLO-1; the structure of the Dense-YOLO-1 network is shown in FIG. 3, and will not be described herein. The Dense-YOLO-1 and YOLOv3 networks were tested. The mAP (Mean Average Precision) of the model was selected as an evaluation target. The value of mAP lies between 0 and 1, the larger the mAP, the better the model accuracy. Of course, the convergence of the model is observed by referring to the loss curve of the model. Wherein the construction of the loss function is still according to the loss function of YOLOv 3. The volume of the network and the detection speed are also considered, so the model file sizes of different networks and the time for detecting the road image on the server Tesla V100 and the edge device Jetson TX2 platforms of different models are recorded. Referring to FIGS. 5-1 and 5-2, FIG. 5-1 is a mAP curve comparison graph of YOLOv3 and Dense-YOLO-1 of the present invention; FIG. 5-2 is a comparison of the loss curves of YOLOv3 and Dense-YOLO-1 of the present invention; as can be seen from the figure, the network precision of the Dense-YOLO-1 is improved by about 4 percent, the loss function difference of the model is extremely slight, so that the difference between the Dense-YOLO-1 and the model is magnified by using a semilogarithmic coordinate, and the loss of the Dense-YOLO-1 is slightly lower than that of YOLOv 3. Therefore, as can be seen from the precision and loss curves, the network performance can be greatly improved by replacing the residual structure in YOLOv3 with dense connections and improving the transition modules between the dense connection modules.

On the basis of Dense-YOLO-1, an improved idea of multi-scale is to add a target detection scale with finer granularity to YOLO v3, so that the network can detect smaller objects. The embodiment of the invention specifically increases the dimension of 104 x 104, sets the corresponding anchor box size, and names the obtained network as MultiScale-YOLO-1. The network structure is understood with reference to fig. 2 and fig. 3, and will not be described in detail. The mAP and loss curves for the Dense-YOLO-1 and MultiScale-YOLO-1 networks are shown in FIGS. 6-1 and 6-2. It can be seen that the multi-scale network is improved compared with the densely connected network, but the change is not obvious, only about 7%, and the difference of the loss curves is still not obvious. This may be because the number of small-sized objects in the data set is not large, the need for fine-grained identification is not strong, and the gain of increasing the finer granularity of object detection to the accuracy of the network is not significant. Therefore, on one hand, a data set with more detailed small target labeling can be searched, so that the network can carry out more fine-grained training in the training process, and more tiny objects can be identified in the identification process. Of course, if the requirements are high, the data set may be labeled on its own in the event that time and effort are sufficient and there is no suitable data set.

On the basis of Dense-YOLO-1, another multi-scale improvement idea is to try to improve the feature fusion method from the lower part of the feature fusion method, so that the detection process is fused with semantic information with more dimensions, and the target identification precision is improved. Therefore, the characteristic fusion mode of the FPN network is improved, a fusion mode of a top-down Dense connection mode is adopted, and the obtained network is named as Dense-YOLO-2. The network structure is not shown. The mAP and loss curves of the Dense-YOLO-1 and Dense-YOLO-2 networks are shown in FIGS. 7-1 and 7-2. The fusion mode is changed, and in the multi-scale feature fusion network with the top-down dense connection feature fusion method, the advantage of multi-scale is more obvious, which is probably because in the dense connection feature fusion method, more high-dimensional abstract semantic information than transverse connection is reserved, so that the object can be more clearly distinguished by the model. The network precision after the fusion mode is changed is improved by 18.2 percent compared with the original network precision, and the loss curve is also reduced by a little compared with the prior network precision. According to the graph, the improvement of the fusion mode is obvious for improving the network precision.

The two multi-scale improvement methods are combined on the basis of Dense-YOLO-1, a multi-scale feature fusion model is utilized, the visual field of the network is increased, the positioning accuracy of objects with different scales is improved, high-dimensional semantic information is fused more fully by using a top-down Dense connection method, the classification effect of the network on different objects is enhanced, the finally obtained network structure is named as MultiScale-YOLO-2, and the structure is not shown any more. The accuracy and loss of the network is shown in FIGS. 8-1 and 8-2 in comparison to Dense-YOLO-1. It can be seen that the precision of the Dense fusion network structure of the finer-grained vision field is improved by 24.5% and the loss curve is further reduced relative to that of Dense-YOLO-1, which indicates that such an improved method is effective.

In a second aspect, corresponding to the above method embodiment, an embodiment of the present invention provides a road surface foreign object detection apparatus, referring to fig. 9, where fig. 9 includes:

an obtaining module 901, configured to obtain a target road surface image to be detected;

a feature extraction module 902, configured to input the target road surface image into an improved YOLOv3 network obtained through pre-training, and perform feature extraction by using a backbone network in a dense connection form to obtain x feature maps with different scales; x is a natural number of 4 or more;

a feature fusion module 903, configured to perform feature fusion on the x feature maps with different scales in a top-down dense connection manner by using an FPN network, so as to obtain a prediction result corresponding to each scale;

and the classification and NMS module 904 is configured to process all prediction results through a classification network and a non-maximum suppression module to obtain an identification result of each target in the target road surface image, where the identification result includes a category and a position of the target.

For details, please refer to the contents of the method for detecting a foreign object on a road surface in the first aspect, which is not described herein again.

In a third aspect, corresponding to the foregoing method embodiments, an embodiment of the present invention further provides an electronic device, as shown in fig. 10, including a processor 1001, a communication interface 1002, a memory 1003, and a communication bus 1004, where the processor 1001, the communication interface 1002, and the memory 1003 complete mutual communication through the communication bus 1004,

a memory 1003 for storing a computer program;

the processor 1001 is configured to implement the steps of any one of the above-described road surface foreign object detection methods when executing the program stored in the memory 1003.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

Through above-mentioned electronic equipment, can realize:

by replacing a residual error module in a trunk network of the YOLOv3 network with a dense connection module and changing a feature fusion mode from parallel to serial, when feature extraction is carried out on the trunk network, an early feature map can be directly used as input of each layer behind, the obtained feature map has more information, and feature transfer is strengthened, so that the detection precision can be improved when airport pavement detection is carried out.

In addition, the characteristic graphs of at least four scales are extracted by transmitting the characteristic graphs from shallow to deep, and the foreign matters, especially the tiny foreign matters, of different scales can be detected by the network by increasing the characteristic extraction scale of fine granularity, and meanwhile, the foreign matters are accurately classified.

Moreover, the FPN is utilized to perform feature fusion in a top-down dense connection mode, deep features are directly subjected to upsampling of different multiples, all transmitted feature graphs have the same size, the feature graphs and shallow feature graphs are fused together in a cascading mode, more original information can be utilized, high-dimensional semantic information participates in the shallow network, and the detection precision is improved; meanwhile, more specific characteristics can be obtained by directly receiving the characteristics of a shallower network, the loss of the characteristics can be effectively reduced, the parameter quantity needing to be calculated can be reduced, the detection speed is improved, and real-time detection is realized.

In a fourth aspect, corresponding to the method for detecting foreign bodies on a road surface provided in the foregoing embodiments, the embodiments of the present invention further provide a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the foregoing methods for detecting foreign bodies on a road surface.

The above-mentioned computer-readable storage medium stores an application program that executes the object detection method provided by the embodiment of the present invention when executed, and thus can implement:

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

For the embodiments of the electronic device and the computer-readable storage medium, since the contents of the related methods are substantially similar to those of the foregoing embodiments of the methods, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the embodiments of the methods.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for detecting foreign matter on a road surface, comprising:

acquiring a target pavement image to be detected;

performing feature fusion of a top-down and dense connection mode on the x feature graphs with different scales by using an FPN network to obtain a prediction result corresponding to each scale;

the improved YOLOv3 network comprises a trunk network, an FPN network, a classification network and a non-maximum suppression module which are connected in sequence and in a dense connection mode; the improved YOLOv3 network is formed by replacing a residual error module in a main network with a dense connection module, increasing a feature extraction scale and optimizing a feature fusion mode of an FPN network on the basis of a YOLOv3 network; the improved Yolov3 network is trained according to a sample pavement image and the category and position of each target in the sample pavement image.

2. The method of claim 1, wherein the backbone network in a densely connected form comprises:

3. The method for detecting foreign matters on a road surface according to claim 2, wherein the extracting features by using a backbone network in a dense connection form to obtain x feature maps with different scales comprises:

4. The method of claim 2, wherein the transition module comprises the convolutional network module and a max-pooling layer; the input of the convolution network module and the input of the maximum pooling layer are shared, and the feature graph output by the convolution network module and the feature graph output by the maximum pooling layer are fused in a cascading mode.

5. The method for detecting foreign matters on the road surface according to claim 4, wherein the number of the convolutional network modules included in the transition module is two or three, and the convolutional network modules are connected in series.

6. The method of claim 3, wherein the FPN network comprises x prediction branches Y of successively increasing scale₁～Y_x(ii) a Wherein the prediction branch Y₁～Y_xIs in one-to-one correspondence with the dimensions of the x feature maps；

7. The method of claim 1, wherein training the improved YOLOv3 network further comprises:

acquiring a plurality of sample images with marked target frame sizes;

8. A road surface foreign matter detection device, characterized by comprising:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.