CN116665036B

CN116665036B - RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5

Info

Publication number: CN116665036B
Application number: CN202310211247.7A
Authority: CN
Inventors: 张秀伟; 张艳宁; 倪涵; 汪进中; 王文娜; 邢颖慧
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2024-09-17
Anticipated expiration: 2043-03-07
Also published as: CN116665036A

Abstract

The invention relates to an RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv, and designs an RGB-infrared multisource YOLOv image target detection model MSAS-YOLOv of the single-mode auxiliary supervision. The model uses YOLOv which takes the speed and the precision into consideration as a basic method, and efficiently extracts the multi-level characteristics of RGB visible light and thermal infrared images. A method for single-mode auxiliary supervision is designed, and semantic segmentation auxiliary tasks and single-mode detection auxiliary tasks are combined. Independent target detection prediction branches are arranged for the visible light and thermal infrared modes, and independent visible light thermal infrared mode labels are used for respectively monitoring predictions of the two modes. In terms of accuracy, the MSAS-YOLOv achieves an average logarithmic omission ratio of 5.89% on the KAIST dataset, and has smaller omission degree supervision on a target detection object. The FPS reaches 24.39, and the method has higher speed than the mainstream multisource target detection method.

Description

RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5

Technical Field

The invention belongs to the technical field of computer vision image processing, and particularly relates to an RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv.

Background

Object detection is an important area of computer vision. Image-based object detection detects the object category and object position present in an image through image processing and computer vision techniques. The multi-source RGB-infrared image is used for collaborative target detection, so that the problem that the identification effect is poor under the conditions of bad weather and poor illumination in RGB image target detection can be solved. The RGB-infrared multi-source image target detection technology has important application in the fields of automatic driving, video monitoring, security protection and the like.

In recent years, deep learning techniques typified by convolutional neural networks have become a mainstream method in the field of target detection. The object detection method based on deep learning can be largely classified into a single-stage object detection represented by YOLO series and a double-stage object detection represented by FASTER RCNN series. Deep learning-based RGB-infrared multi-source target detection is also widely studied, and is superior to multi-source target detection based on the traditional method. For example, zhang et al in literature "Weakly aligned cross-modal learning for multispectral pedestrian detection[C].Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5127-5137." designed a multisource target detection framework based on Faster R-CNN. And provides visible light and thermal infrared mode independent labeling for the multi-source target detection data set because of misalignment in the multi-source target detection data set. I.e. objects in the visible and thermal infrared images have different positions. This misalignment will affect the supervision of the multi-source object detection network. Kim et al in document "Uncertainty-guided cross-modal learning for robust multispectral pedestrian detection[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(3):1510-1523." propose a FASTER RCNN based uncertainty aware multi-source pedestrian detection framework that enables better multi-source modality discrimination features.

However, the current deep learning RGB-infrared multi-source target detection method has some problems: 1. most of the current multi-source target detection methods are based on the Faster R-CNN double-stage target detection method, so that the reasoning speed is low; 2. existing work rarely explicitly utilizes independent labeling of visible light and thermal infrared to provide additional ancillary information for object detection. The visible light and thermal infrared mode independent labels can provide more accurate characteristic learning supervision for the visible light and thermal infrared backbone network; therefore, it is necessary to design a single-mode auxiliary monitoring multisource YOLOv target detection method which is based on a rapid single-stage target detection method and is monitored by combining auxiliary information.

Disclosure of Invention

Technical problem to be solved

Aiming at the problems that the existing multi-source target detection method is low in detection speed and auxiliary information is not fully utilized to promote detection supervision, the invention provides the RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv.

Technical proposal

An RGB-infrared multi-source image target detection model based on single-mode auxiliary supervision and YOLOv is characterized by comprising four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part; the backbone network part is used for extracting the convolution characteristic diagrams of the visible light and the thermal infrared images respectively; the bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode; the segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer and takes the feature map as an auxiliary supervision task; the detection part carries out target detection on each level of characteristics of visible light and thermal infrared modes, and carries out decision-level fusion on the fusion, the prediction results of the visible light and the thermal infrared modes; and the prediction of the visible light and thermal infrared modes is supervised by labels in the respective modes, and the auxiliary information of single-mode prediction is provided.

An RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv is characterized by comprising the following steps:

Step 1: the visible light and thermal infrared images I _rgb and I _ir are input into a backbone network part for feature extraction, and the visible light and thermal infrared images I _rgb and I _ir sequentially pass through a visible light convolution module And thermal infrared convolution moduleWherein i is {1,2,3,4,5}, a plurality of feature pairs are obtainedAndWherein the method comprises the steps ofVisible light convolution module for visible light image I _rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I _ir Post-extracted features; at the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics

Step 2: convoluting the visible light in the step 1And thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;

Step 3: characterization of backbone network part in step 1 Features from a spatial pyramid pooling moduleFeeding into the bottleneck section; in order to realize single-mode independent prediction, a bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted; wherein the modality bottleneck structure is fused AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputThermal infrared mode bottleneck structureAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale features

Step 4: performing target detection on the three-to-multi-scale features obtained in the bottleneck part of the three modes in the step 3 by using a detection module to obtain object prediction under each mode and each scale; for prediction of three modes of fusion, thermal infrared and visible light, fusion of prediction under the same scale is carried out by using decision-level fusion; and finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input multi-source image.

The invention further adopts the technical scheme that: the visible light convolution module in the step 1And thermal infrared convolution moduleEach of which is formed by serially connecting a basic convolution module and a cross-stage local convolution module; the cross-stage local convolution module splits an input feature map into two feature maps F _p1 and F _p2,F_p1 according to channels, residual connection is carried out only through one basic convolution module, F _p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.

The invention further adopts the technical scheme that: the visible light convolution moduleAnd thermal infrared convolution moduleThe parameters are independent.

The invention further adopts the technical scheme that: the split convolution module in the step 2 refers to a structure formed by serially connecting 3×3 convolution layers, batch normalization, correction linear units and 3×3 convolution layers.

The invention further adopts the technical scheme that: the three modality bottleneck section in step3 processes the multi-scale feature map using a path aggregation network PAN structure that includes top-down and bottom-up paths.

The invention further adopts the technical scheme that: the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighting linear unit.

The invention further adopts the technical scheme that: the detection module in step 4 is composed of a1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.

The invention further adopts the technical scheme that: in the step 4, the decision-level fusion of the detection results of the three modes refers to weighted average of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.

Advantageous effects

The invention provides a RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv, which designs an RGB-infrared multisource YOLOv image target detection model MSAS-YOLOv5 of the single-mode auxiliary supervision. The model uses YOLOv which takes the speed and the precision into consideration as a basic method, and efficiently extracts the multi-level characteristics of RGB visible light and thermal infrared images. A method for single-mode auxiliary supervision is designed, and semantic segmentation auxiliary tasks and single-mode detection auxiliary tasks are combined. Independent target detection prediction branches are arranged for the visible light and thermal infrared modes, and independent visible light thermal infrared mode labels are used for respectively monitoring predictions of the two modes. In terms of accuracy, the MSAS-YOLOv achieves an average logarithmic omission ratio of 5.89% on the KAIST dataset, and has smaller omission degree supervision on a target detection object. The FPS reaches 24.39, and the method has higher speed than the mainstream multisource target detection method.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

Fig. 1 is a network configuration diagram of a method according to an embodiment of the present invention.

FIG. 2 is a schematic diagram showing comparison of target detection results of the method according to the embodiment of the present invention and other prior art methods.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

A RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv is characterized in that a new single-mode auxiliary supervision multisource YOLOv target detection model is built for RGB-infrared multisource image target detection by utilizing a single-stage target detection method YOLOv which takes both speed and precision into consideration and introducing single-mode auxiliary supervision. The detection model comprises four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part. The backbone network part is used for extracting the convolution characteristic graphs of the visible light and the thermal infrared images respectively. The bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode. The segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer, and serves as an auxiliary supervision task. The detection part performs target detection on each level of characteristics of the visible light and thermal infrared modes, and performs decision-level fusion on the fusion and prediction results of the visible light and thermal infrared modes. And the prediction of the visible light and thermal infrared modes is supervised by labels in the respective modes, and the auxiliary information of single-mode prediction is provided.

The method specifically comprises the following steps:

Step 1: the visible and thermal infrared images I _rgb and I _ir are input into the backbone network part for feature extraction. The visible light and thermal infrared images I _rgb and I _ir sequentially pass through a visible light convolution module And thermal infrared convolution module(Where i.epsilon. {1,2,3,4,5 }) to obtain a plurality of feature pairsAndWherein the method comprises the steps ofVisible light convolution module for visible light image I _rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I _ir Post-extracted features. At the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics

Step 3: characterization of backbone network part in step 1 Features from a spatial pyramid pooling moduleInto the bottleneck section. In order to realize single-mode independent prediction, the bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted. Wherein the modality bottleneck structure is fused AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputThermal infrared mode bottleneck structureAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresIn order to reduce the parameter quantity and improve the efficiency, the bottleneck part structural parameters of the visible light and the thermal infrared modes are shared;

step 4: and (3) performing target detection on the three-to-multi-scale features obtained in the bottleneck part of the three modes in the step (3) by using a detection module to obtain object prediction under each mode and each scale. For prediction of three modes of fusion, thermal infrared and visible light, decision-level fusion is used for fusion of predictions under the same scale. And finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input multi-source image.

Optionally, the visible light convolution module in step 1And thermal infrared convolution moduleAre each formed by a basic convolution module and a cross-stage local convolution module which are connected in series. The cross-stage local convolution module splits an input feature map into two feature maps F _p1 and F _p2,F_p1 according to channels, residual connection is carried out only through one basic convolution module, F _p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.

Optionally, the visible light convolution module in step 1And thermal infrared convolution moduleThe parameters are independent.

Optionally, the split convolution module in step 2 refers to a structure formed by concatenating 3×3 convolution layers, batch normalization, and correction linear units, and 3×3 convolution layers.

Optionally, the semantic segmentation predictions predicted in step 2 will be used for assisted supervision at training time. And in the training stage of the network model, using the target detection boundary box label as a mask, and supervising the semantic segmentation prediction result obtained in the step 2.

Optionally, the three modality bottleneck section in step 3 processes the multi-scale feature map using a path aggregation network PAN (Path Aggregation Network) structure that contains top-down and bottom-up paths.

Optionally, the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighted linear unit.

Optionally, the detection module in step 4 is formed by a 1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.

Optionally, the decision-level fusion of the detection results of the three modes in the step 4 refers to weighted averaging of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.

Optionally, the detection results of the three modes in the step 4 are respectively subjected to supervised training by using the loss function and matching with training marks. And taking a label shared by visible light and thermal infrared as a real label to be used as a supervision of a fusion mode detection result. In order to realize single-mode detection auxiliary supervision, a label unique to a visible light mode is used as supervision of a visible light mode detection result, and a label unique to a thermal infrared mode is used as supervision of a thermal infrared detection result.

Specific examples:

As shown in FIG. 1, the invention designs an RGB-infrared multi-source image target detection model based on single-mode auxiliary supervision and YOLOv aiming at the problems that the existing multi-source target detection method is insufficient in speed and less in comprehensive utilization of effective supervision information. Which includes a backbone network part, a bottleneck part, a partition part, and a prediction part. The specific method comprises the following steps:

S1, inputting visible light and thermal infrared images I _rgb and I _ir into a backbone network part for feature extraction to obtain a plurality of feature pairs AndWherein the method comprises the steps ofVisible light convolution module for visible light image I _rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I _ir Post-extracted features. At the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics

S2, as shown in FIG. 1, the visible light convolution moduleAnd thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;

s3, the characteristics of the backbone network part in the step 1 are determined Features from a spatial pyramid pooling moduleInto the bottleneck section. In order to realize single-mode independent prediction, the bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted. Wherein the modal bottleneck structure is fused to AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputTo be used forAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresIn order to reduce the parameter quantity and improve the efficiency, the bottleneck part structural parameters of the visible light and the thermal infrared modes are shared;

S4, using the detection module to obtain three groups of multi-scale multi-mode features (fusion mode features) Visible light mode characteristicsThermal infrared morphological features) And performing target detection. Respectively obtaining multi-scale prediction results of three groups of modesFor prediction of three modes of fusion, thermal infrared and visible light, decision-level fusion is used for fusion of predictions under the same scale. And finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input image.

In this embodiment, the execution network of step S1-step S4 is simply referred to as MSAS-YOLOv. The execution of steps S1-S4 will be described in further detail below in conjunction with the structure of MSAS-YOLOv.

In this embodiment, in step S1, the visible light image I _rgb with a size of H _input×W_input ×3 and the thermal infrared image I _ir with a size of H _input×W_input ×3 are sent to the backbone network part for feature extraction. Each convolution moduleThe feature map is reduced in size by 1/2. After passing through the convolution module i, a feature map with the size of H _input/2ⁱ×W_input/2ⁱ is obtained. Wherein the method comprises the steps ofAnd the output feature size and channel number of the pyramid pooling module are H _input/8×W_input/8×256、H_input/16×W_input/16×512 and H _input/32×W_input/32×1024 respectively.And output features of a spatial pyramid pooling moduleAndInto the bottleneck section and the dividing section.

In this embodiment, the visible light convolution moduleAnd thermal infrared convolution moduleAre each formed by a basic convolution module and a cross-stage local convolution module which are connected in series. The cross-stage local convolution module splits an input feature map into two feature maps F _p1 and F _p2,F_p1 according to channels, residual connection is carried out through only one basic convolution module, F _p2 is carried out through n basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained. (AndAnd taking 1, 3, 6, 9 and 3 from n corresponding to 1 to 5, and finally splicing the two part characteristic diagrams and obtaining final output through a convolution module.

In this embodiment, in step S1, the parameters of the visible light convolution module and the thermal infrared convolution module are independent.

In this embodiment, the split convolution module in step S2 refers to a structure formed by concatenating 3×3 convolution layers, batch normalization, correction linear units, and 3×3 convolution layers. The first 3 x 3 convolutional layer has 256 output channels. The final 3 x 3 convolutional layer output channel number is1, indicating the probability of containing an object per position prediction. Finally, semantic segmentation prediction with the size of H _input/8×W_input/8 is obtained.

In this embodiment, the semantic segmentation predictions predicted in step S2 will be used for assisted supervision at training time. In the training phase of the network model, the target detection boundary box label is used as a mask. And (3) setting the position of the object contained in the original image as 1, scaling to 1/8 of the size of the original image as a semantic segmentation training label G _seg, and supervising the semantic segmentation prediction result P _seg obtained in the step S2. And a semantic segmentation auxiliary supervision is formed by using the loss function L _seg. Wherein lambda _seg is the weight of semantic segmentation prediction loss, and is set to 0.5.L _bce is a two-class cross entropy loss.

L_seg＝λ_segL_bce(P_seg,G_seg)

In this embodiment, as shown in fig. 1, the bottleneck portion in step S3 is provided with three groups of structures including a fusion mode bottleneck, a thermal infrared mode bottleneck, and a visible mode bottleneck. Three bottleneck sections each process a multi-scale feature map using a via aggregation network PAN (Path Aggregation Network) structure that includes top-down and bottom-up vias.

In this example, as shown in FIG. 1, the bottleneck structure of the modality is fusedAndIs input. The top-down path comprises PAN modules 1 to 3, and the bottom-up path comprises PAN module 4 and PAN module 5.PAN module 1 uses basic convolution module processingTo treat the processed characteristicsInto PAN module 5 and thenUpsampling to twice the size givesInto the PAN module 2.

In this embodiment, the PAN module 2 and the PAN module 3 include a channel splicing and basic convolution module. As shown in fig. 1, PAN module 2 and PAN module 3 aggregate PAN module features from the previous level, and from the corresponding scaleThe PAN module 4 and the PAN module 5 comprise a basic convolution module, a channel splicing and an n-ary 3 cross-stage local convolution module. The basic convolution module step size of the PAN module 4 and the PAN module 5 is set to 2, and the PAN module from the previous stage is reduced in scale. And splicing with a PAN module with the same scale, and finally obtaining output characteristics through a cross-stage local convolution module. Output multiscale features of PAN modules 3, 4, 5, respectivelyThe output feature size and channel number are H _input/8×W_input/8×256、H_input/16×W_input/16X 512 and H _input/32×W_input/32X 1024, respectively.

In this embodiment, as shown in FIG. 1, the thermal infrared mode bottleneck structure is formed byAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresThe processing process of the multi-scale characteristics of the visible light and thermal infrared modes is similar to that of the fusion mode, and the structure of the PAN module is the same as that of the fusion mode on the PAN modules 2, 3, 4 and 5, but parameters independent of the corresponding structure of the fusion mode are used.

In this embodiment, the basic convolution modules in the backbone network part and the bottleneck part are composed of 1×1 and 3×3 convolution layers, a batch normalization layer and Sigmoid weighted linear units.

In this embodiment, the detection module in step S4 is composed of a1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.

In this embodiment, the decision-level fusion of the detection results of the three modes in step S4 refers to fusing the same-scale detection results of the visible light and thermal infrared modesAndWeighted average is carried out to obtain the final detection resultAs shown in the following formula. Wherein the weight coefficients lambda _F、λ_I and lambda _R of the three modes are 0.5, 0.25 and 0.25 respectively.

In this embodiment, the detection results of the three modes in step S4 are respectively used for supervised training by matching the loss function with the training label. And taking a label shared by visible light and thermal infrared as a real label to be used as a supervision of a fusion mode detection result. In order to realize single-mode detection auxiliary supervision, a label unique to a visible light mode is used as supervision of a visible light mode detection result, and a label unique to a thermal infrared mode is used as supervision of a thermal infrared detection result. The total loss function for this step is L _s4, where L _cls、L_bbox and L _obj are the classification loss, bounding box loss, and confidence loss of the fusion modality prediction. p_f is the prediction result of the fusion mode, which is composed ofAnd merging to obtain the product. gt_f is a training label of modality sharing. L _{obj_v}、L_{bbox_v} is the confidence loss and bounding box loss for the visible-light model prediction task, and L _{obj_t}、L_{bbox_t} is the confidence loss and bounding box loss for the thermal-infrared model prediction task. p_v is the single-mode prediction result of the visible light mode, which is composed ofAnd merging. p_t is the thermal infrared mode single-mode prediction result, which is respectively composed ofAnd merging. gt_v and gt_t are respectively a visible light mark and a thermal infrared mark. Independent labeling of visible and thermal infrared modes can provide mode-specific supervisory information for the visible and thermal infrared modes.

L _cls、L_obj、L_{obj_v}、L_{obj_t} is the cross entropy loss and L _bbox、L_{bbox_v}、L_{bbox_t} is the CIoU loss. Lambda _cls、λ_obj and lambda _bbox are the corresponding weights of the classification penalty, confidence penalty and bounding box penalty, respectively 0.5,0.5,0.05.

L_det＝λ_clsL_cls(p_f,gt_f)+λ_objL_obj(p_f,gt_f)+λ_bboxL_bbox(p_f,gt_f)+λ_objL_{obj_v}(p_v,gt_v)+λ_bboxL_{bbox_v}(p_v,gt_v)+λ_objL_{obj_t}(p_t,gt_t)+λ_bboxL_{bbox_t}(p_t,gt_t)

In this embodiment, the training is performed in a deep supervision manner after step S4, and the total loss function is as follows:

L_total＝L_seg+L_det

Wherein, L _seg and L _det are the semantic segmentation loss of step S2 and the target detection loss of step S4, respectively.

To verify the validity of MSAS-YOLOv, the present embodiment uses the public dataset KAIST for training and testing of the network framework and is compared to other methods. The KAIST dataset contains 7595 pairs of visible and infrared images versus training sets, 2252 pairs of visible and infrared images versus test sets. The picture size is 512 x 640 pixels. And the method can be divided into three sub-data sets of whole day, day and night, and the performance of the multi-source target detection algorithm on the whole data, day data and night data is tested respectively.

The algorithm proposed in this example is compared with 5 latest change detection methods ,CIAN(Cross-modality Interactive Attention Network)、MSDS-RCNN(Multispectral Simultaneous Detection and Segmentation R-CNN)、AR-CNN(Aligned Region CNN)、MBNet(Modality Balance Network) and UGCML (noncertainty-Guided Cross-model Learning), and the specific results are shown in table 1. The total number of evaluation indexes is 2, and the evaluation is performed by using an average logarithmic omission ratio (MR ^-2 for short) index in accuracy. The smaller the index value is, the less the omission factor of the target detector is, and the higher the accuracy is. The speed is measured using Frames Per Second (FPS). As can be seen from the combination of Table 1, the average logarithmic omission ratio index of the method (MSAS-YOLOv) of the embodiment on the all-day sub-dataset reaches the lowest 5.89%, which shows that the method has the highest detection precision. 24.39FPS was reached at speed, exceeding other multi-source target detection methods in the table. Compared to the second best (MBNet), MSAS-YOLOv5 reduced the average log-miss rate by 2.27% over the full day sub-dataset, by 1.01% over the daytime sub-dataset, and by 4.24% over the night sub-dataset. Fig. 2 is a comparison schematic diagram of three sets of multi-source target detection results of the method of the embodiment and the main stream existing method, and as can be seen from the first row of fig. 2, the detection result of the method of the embodiment is more comprehensive and more approximate to the actual detection situation. The second line of input images has many small pedestrian targets, and the method of the embodiment can accurately detect small pedestrians, while other comparison methods have the condition of missed detection. The detection result of the method in the embodiment still solves the real detection result more than other methods for the input image of the third behavior night scene.

Table 1 is a comparative table of test results for the methods of the examples of the present invention and other prior art methods

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.

Claims

1. The RGB-infrared multi-source image target detection method based on the single-mode auxiliary supervision and YOLOv is characterized by comprising four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part; the backbone network part is used for extracting the convolution characteristic diagrams of the visible light and the thermal infrared images respectively; the bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode; the segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer and takes the feature map as an auxiliary supervision task; the detection part carries out target detection on each level of characteristics of visible light and thermal infrared modes, and carries out decision-level fusion on the fusion, the prediction results of the visible light and the thermal infrared modes; monitoring prediction of visible light and thermal infrared modes by labels in respective modes, and providing auxiliary information of single-mode prediction;

The method specifically comprises the following steps:

2. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the visible light convolution module in the step1And thermal infrared convolution moduleEach of which is formed by serially connecting a basic convolution module and a cross-stage local convolution module; the cross-stage local convolution module splits an input feature map into two feature maps F _p1 and F _p2,F_p1 according to channels, residual connection is carried out only through one basic convolution module, F _p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.

3. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the visible light convolution moduleAnd thermal infrared convolution moduleThe parameters are independent.

4. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the split convolution module in the step 2 refers to a structure formed by serially connecting 3×3 convolution layers, batch normalization, correction linear units and 3×3 convolution layers.

5. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the three modality bottleneck section in step 3 processes the multi-scale feature map using a path aggregation network PAN structure that includes top-down and bottom-up paths.

6. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighting linear unit.

7. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: in the step 4, the detection module is formed by a1 multiplied by 1 convolution layer, target detection is carried out on three layers of characteristic diagrams with the step length of 8, 16 and 32 respectively, and the parameters of the three layers of detection modules are independent; in order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.

8. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: in the step 4, the decision-level fusion of the detection results of the three modes refers to weighted average of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.