CN116665036B - RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 - Google Patents
RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 Download PDFInfo
- Publication number
- CN116665036B CN116665036B CN202310211247.7A CN202310211247A CN116665036B CN 116665036 B CN116665036 B CN 116665036B CN 202310211247 A CN202310211247 A CN 202310211247A CN 116665036 B CN116665036 B CN 116665036B
- Authority
- CN
- China
- Prior art keywords
- mode
- visible light
- thermal infrared
- target detection
- rgb
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 135
- 238000000034 method Methods 0.000 claims abstract description 39
- 230000011218 segmentation Effects 0.000 claims abstract description 29
- 238000012544 monitoring process Methods 0.000 claims abstract description 5
- 230000004927 fusion Effects 0.000 claims description 53
- 238000010586 diagram Methods 0.000 claims description 11
- 238000011176 pooling Methods 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 238000012937 correction Methods 0.000 claims description 4
- 238000012805 post-processing Methods 0.000 claims description 4
- 230000001629 suppression Effects 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 abstract description 4
- 239000000284 extract Substances 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 13
- 238000013527 convolutional neural network Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10048—Infrared image
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Radiation Pyrometers (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to an RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv, and designs an RGB-infrared multisource YOLOv image target detection model MSAS-YOLOv of the single-mode auxiliary supervision. The model uses YOLOv which takes the speed and the precision into consideration as a basic method, and efficiently extracts the multi-level characteristics of RGB visible light and thermal infrared images. A method for single-mode auxiliary supervision is designed, and semantic segmentation auxiliary tasks and single-mode detection auxiliary tasks are combined. Independent target detection prediction branches are arranged for the visible light and thermal infrared modes, and independent visible light thermal infrared mode labels are used for respectively monitoring predictions of the two modes. In terms of accuracy, the MSAS-YOLOv achieves an average logarithmic omission ratio of 5.89% on the KAIST dataset, and has smaller omission degree supervision on a target detection object. The FPS reaches 24.39, and the method has higher speed than the mainstream multisource target detection method.
Description
Technical Field
The invention belongs to the technical field of computer vision image processing, and particularly relates to an RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv.
Background
Object detection is an important area of computer vision. Image-based object detection detects the object category and object position present in an image through image processing and computer vision techniques. The multi-source RGB-infrared image is used for collaborative target detection, so that the problem that the identification effect is poor under the conditions of bad weather and poor illumination in RGB image target detection can be solved. The RGB-infrared multi-source image target detection technology has important application in the fields of automatic driving, video monitoring, security protection and the like.
In recent years, deep learning techniques typified by convolutional neural networks have become a mainstream method in the field of target detection. The object detection method based on deep learning can be largely classified into a single-stage object detection represented by YOLO series and a double-stage object detection represented by FASTER RCNN series. Deep learning-based RGB-infrared multi-source target detection is also widely studied, and is superior to multi-source target detection based on the traditional method. For example, zhang et al in literature "Weakly aligned cross-modal learning for multispectral pedestrian detection[C].Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5127-5137." designed a multisource target detection framework based on Faster R-CNN. And provides visible light and thermal infrared mode independent labeling for the multi-source target detection data set because of misalignment in the multi-source target detection data set. I.e. objects in the visible and thermal infrared images have different positions. This misalignment will affect the supervision of the multi-source object detection network. Kim et al in document "Uncertainty-guided cross-modal learning for robust multispectral pedestrian detection[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(3):1510-1523." propose a FASTER RCNN based uncertainty aware multi-source pedestrian detection framework that enables better multi-source modality discrimination features.
However, the current deep learning RGB-infrared multi-source target detection method has some problems: 1. most of the current multi-source target detection methods are based on the Faster R-CNN double-stage target detection method, so that the reasoning speed is low; 2. existing work rarely explicitly utilizes independent labeling of visible light and thermal infrared to provide additional ancillary information for object detection. The visible light and thermal infrared mode independent labels can provide more accurate characteristic learning supervision for the visible light and thermal infrared backbone network; therefore, it is necessary to design a single-mode auxiliary monitoring multisource YOLOv target detection method which is based on a rapid single-stage target detection method and is monitored by combining auxiliary information.
Disclosure of Invention
Technical problem to be solved
Aiming at the problems that the existing multi-source target detection method is low in detection speed and auxiliary information is not fully utilized to promote detection supervision, the invention provides the RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv.
Technical proposal
An RGB-infrared multi-source image target detection model based on single-mode auxiliary supervision and YOLOv is characterized by comprising four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part; the backbone network part is used for extracting the convolution characteristic diagrams of the visible light and the thermal infrared images respectively; the bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode; the segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer and takes the feature map as an auxiliary supervision task; the detection part carries out target detection on each level of characteristics of visible light and thermal infrared modes, and carries out decision-level fusion on the fusion, the prediction results of the visible light and the thermal infrared modes; and the prediction of the visible light and thermal infrared modes is supervised by labels in the respective modes, and the auxiliary information of single-mode prediction is provided.
An RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv is characterized by comprising the following steps:
Step 1: the visible light and thermal infrared images I rgb and I ir are input into a backbone network part for feature extraction, and the visible light and thermal infrared images I rgb and I ir sequentially pass through a visible light convolution module And thermal infrared convolution moduleWherein i is {1,2,3,4,5}, a plurality of feature pairs are obtainedAndWherein the method comprises the steps ofVisible light convolution module for visible light image I rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I ir Post-extracted features; at the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics
Step 2: convoluting the visible light in the step 1And thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;
Step 3: characterization of backbone network part in step 1 Features from a spatial pyramid pooling moduleFeeding into the bottleneck section; in order to realize single-mode independent prediction, a bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted; wherein the modality bottleneck structure is fused AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputThermal infrared mode bottleneck structureAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale features
Step 4: performing target detection on the three-to-multi-scale features obtained in the bottleneck part of the three modes in the step 3 by using a detection module to obtain object prediction under each mode and each scale; for prediction of three modes of fusion, thermal infrared and visible light, fusion of prediction under the same scale is carried out by using decision-level fusion; and finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input multi-source image.
The invention further adopts the technical scheme that: the visible light convolution module in the step 1And thermal infrared convolution moduleEach of which is formed by serially connecting a basic convolution module and a cross-stage local convolution module; the cross-stage local convolution module splits an input feature map into two feature maps F p1 and F p2,Fp1 according to channels, residual connection is carried out only through one basic convolution module, F p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.
The invention further adopts the technical scheme that: the visible light convolution moduleAnd thermal infrared convolution moduleThe parameters are independent.
The invention further adopts the technical scheme that: the split convolution module in the step 2 refers to a structure formed by serially connecting 3×3 convolution layers, batch normalization, correction linear units and 3×3 convolution layers.
The invention further adopts the technical scheme that: the three modality bottleneck section in step3 processes the multi-scale feature map using a path aggregation network PAN structure that includes top-down and bottom-up paths.
The invention further adopts the technical scheme that: the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighting linear unit.
The invention further adopts the technical scheme that: the detection module in step 4 is composed of a1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.
The invention further adopts the technical scheme that: in the step 4, the decision-level fusion of the detection results of the three modes refers to weighted average of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.
Advantageous effects
The invention provides a RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv, which designs an RGB-infrared multisource YOLOv image target detection model MSAS-YOLOv5 of the single-mode auxiliary supervision. The model uses YOLOv which takes the speed and the precision into consideration as a basic method, and efficiently extracts the multi-level characteristics of RGB visible light and thermal infrared images. A method for single-mode auxiliary supervision is designed, and semantic segmentation auxiliary tasks and single-mode detection auxiliary tasks are combined. Independent target detection prediction branches are arranged for the visible light and thermal infrared modes, and independent visible light thermal infrared mode labels are used for respectively monitoring predictions of the two modes. In terms of accuracy, the MSAS-YOLOv achieves an average logarithmic omission ratio of 5.89% on the KAIST dataset, and has smaller omission degree supervision on a target detection object. The FPS reaches 24.39, and the method has higher speed than the mainstream multisource target detection method.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.
Fig. 1 is a network configuration diagram of a method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram showing comparison of target detection results of the method according to the embodiment of the present invention and other prior art methods.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
A RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv is characterized in that a new single-mode auxiliary supervision multisource YOLOv target detection model is built for RGB-infrared multisource image target detection by utilizing a single-stage target detection method YOLOv which takes both speed and precision into consideration and introducing single-mode auxiliary supervision. The detection model comprises four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part. The backbone network part is used for extracting the convolution characteristic graphs of the visible light and the thermal infrared images respectively. The bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode. The segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer, and serves as an auxiliary supervision task. The detection part performs target detection on each level of characteristics of the visible light and thermal infrared modes, and performs decision-level fusion on the fusion and prediction results of the visible light and thermal infrared modes. And the prediction of the visible light and thermal infrared modes is supervised by labels in the respective modes, and the auxiliary information of single-mode prediction is provided.
The method specifically comprises the following steps:
Step 1: the visible and thermal infrared images I rgb and I ir are input into the backbone network part for feature extraction. The visible light and thermal infrared images I rgb and I ir sequentially pass through a visible light convolution module And thermal infrared convolution module(Where i.epsilon. {1,2,3,4,5 }) to obtain a plurality of feature pairsAndWherein the method comprises the steps ofVisible light convolution module for visible light image I rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I ir Post-extracted features. At the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics
Step 2: convoluting the visible light in the step 1And thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;
Step 3: characterization of backbone network part in step 1 Features from a spatial pyramid pooling moduleInto the bottleneck section. In order to realize single-mode independent prediction, the bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted. Wherein the modality bottleneck structure is fused AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputThermal infrared mode bottleneck structureAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresIn order to reduce the parameter quantity and improve the efficiency, the bottleneck part structural parameters of the visible light and the thermal infrared modes are shared;
step 4: and (3) performing target detection on the three-to-multi-scale features obtained in the bottleneck part of the three modes in the step (3) by using a detection module to obtain object prediction under each mode and each scale. For prediction of three modes of fusion, thermal infrared and visible light, decision-level fusion is used for fusion of predictions under the same scale. And finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input multi-source image.
Optionally, the visible light convolution module in step 1And thermal infrared convolution moduleAre each formed by a basic convolution module and a cross-stage local convolution module which are connected in series. The cross-stage local convolution module splits an input feature map into two feature maps F p1 and F p2,Fp1 according to channels, residual connection is carried out only through one basic convolution module, F p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.
Optionally, the visible light convolution module in step 1And thermal infrared convolution moduleThe parameters are independent.
Optionally, the split convolution module in step 2 refers to a structure formed by concatenating 3×3 convolution layers, batch normalization, and correction linear units, and 3×3 convolution layers.
Optionally, the semantic segmentation predictions predicted in step 2 will be used for assisted supervision at training time. And in the training stage of the network model, using the target detection boundary box label as a mask, and supervising the semantic segmentation prediction result obtained in the step 2.
Optionally, the three modality bottleneck section in step 3 processes the multi-scale feature map using a path aggregation network PAN (Path Aggregation Network) structure that contains top-down and bottom-up paths.
Optionally, the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighted linear unit.
Optionally, the detection module in step 4 is formed by a 1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.
Optionally, the decision-level fusion of the detection results of the three modes in the step 4 refers to weighted averaging of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.
Optionally, the detection results of the three modes in the step 4 are respectively subjected to supervised training by using the loss function and matching with training marks. And taking a label shared by visible light and thermal infrared as a real label to be used as a supervision of a fusion mode detection result. In order to realize single-mode detection auxiliary supervision, a label unique to a visible light mode is used as supervision of a visible light mode detection result, and a label unique to a thermal infrared mode is used as supervision of a thermal infrared detection result.
Specific examples:
As shown in FIG. 1, the invention designs an RGB-infrared multi-source image target detection model based on single-mode auxiliary supervision and YOLOv aiming at the problems that the existing multi-source target detection method is insufficient in speed and less in comprehensive utilization of effective supervision information. Which includes a backbone network part, a bottleneck part, a partition part, and a prediction part. The specific method comprises the following steps:
S1, inputting visible light and thermal infrared images I rgb and I ir into a backbone network part for feature extraction to obtain a plurality of feature pairs AndWherein the method comprises the steps ofVisible light convolution module for visible light image I rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I ir Post-extracted features. At the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics
S2, as shown in FIG. 1, the visible light convolution moduleAnd thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;
s3, the characteristics of the backbone network part in the step 1 are determined Features from a spatial pyramid pooling moduleInto the bottleneck section. In order to realize single-mode independent prediction, the bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted. Wherein the modal bottleneck structure is fused to AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputTo be used forAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresIn order to reduce the parameter quantity and improve the efficiency, the bottleneck part structural parameters of the visible light and the thermal infrared modes are shared;
S4, using the detection module to obtain three groups of multi-scale multi-mode features (fusion mode features) Visible light mode characteristicsThermal infrared morphological features) And performing target detection. Respectively obtaining multi-scale prediction results of three groups of modesFor prediction of three modes of fusion, thermal infrared and visible light, decision-level fusion is used for fusion of predictions under the same scale. And finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input image.
In this embodiment, the execution network of step S1-step S4 is simply referred to as MSAS-YOLOv. The execution of steps S1-S4 will be described in further detail below in conjunction with the structure of MSAS-YOLOv.
In this embodiment, in step S1, the visible light image I rgb with a size of H input×Winput ×3 and the thermal infrared image I ir with a size of H input×Winput ×3 are sent to the backbone network part for feature extraction. Each convolution moduleThe feature map is reduced in size by 1/2. After passing through the convolution module i, a feature map with the size of H input/2i×Winput/2i is obtained. Wherein the method comprises the steps ofAnd the output feature size and channel number of the pyramid pooling module are H input/8×Winput/8×256、Hinput/16×Winput/16×512 and H input/32×Winput/32×1024 respectively.And output features of a spatial pyramid pooling moduleAndInto the bottleneck section and the dividing section.
In this embodiment, the visible light convolution moduleAnd thermal infrared convolution moduleAre each formed by a basic convolution module and a cross-stage local convolution module which are connected in series. The cross-stage local convolution module splits an input feature map into two feature maps F p1 and F p2,Fp1 according to channels, residual connection is carried out through only one basic convolution module, F p2 is carried out through n basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained. (AndAnd taking 1, 3, 6, 9 and 3 from n corresponding to 1 to 5, and finally splicing the two part characteristic diagrams and obtaining final output through a convolution module.
In this embodiment, in step S1, the parameters of the visible light convolution module and the thermal infrared convolution module are independent.
In this embodiment, the split convolution module in step S2 refers to a structure formed by concatenating 3×3 convolution layers, batch normalization, correction linear units, and 3×3 convolution layers. The first 3 x 3 convolutional layer has 256 output channels. The final 3 x 3 convolutional layer output channel number is1, indicating the probability of containing an object per position prediction. Finally, semantic segmentation prediction with the size of H input/8×Winput/8 is obtained.
In this embodiment, the semantic segmentation predictions predicted in step S2 will be used for assisted supervision at training time. In the training phase of the network model, the target detection boundary box label is used as a mask. And (3) setting the position of the object contained in the original image as 1, scaling to 1/8 of the size of the original image as a semantic segmentation training label G seg, and supervising the semantic segmentation prediction result P seg obtained in the step S2. And a semantic segmentation auxiliary supervision is formed by using the loss function L seg. Wherein lambda seg is the weight of semantic segmentation prediction loss, and is set to 0.5.L bce is a two-class cross entropy loss.
Lseg=λsegLbce(Pseg,Gseg)
In this embodiment, as shown in fig. 1, the bottleneck portion in step S3 is provided with three groups of structures including a fusion mode bottleneck, a thermal infrared mode bottleneck, and a visible mode bottleneck. Three bottleneck sections each process a multi-scale feature map using a via aggregation network PAN (Path Aggregation Network) structure that includes top-down and bottom-up vias.
In this example, as shown in FIG. 1, the bottleneck structure of the modality is fusedAndIs input. The top-down path comprises PAN modules 1 to 3, and the bottom-up path comprises PAN module 4 and PAN module 5.PAN module 1 uses basic convolution module processingTo treat the processed characteristicsInto PAN module 5 and thenUpsampling to twice the size givesInto the PAN module 2.
In this embodiment, the PAN module 2 and the PAN module 3 include a channel splicing and basic convolution module. As shown in fig. 1, PAN module 2 and PAN module 3 aggregate PAN module features from the previous level, and from the corresponding scaleThe PAN module 4 and the PAN module 5 comprise a basic convolution module, a channel splicing and an n-ary 3 cross-stage local convolution module. The basic convolution module step size of the PAN module 4 and the PAN module 5 is set to 2, and the PAN module from the previous stage is reduced in scale. And splicing with a PAN module with the same scale, and finally obtaining output characteristics through a cross-stage local convolution module. Output multiscale features of PAN modules 3, 4, 5, respectivelyThe output feature size and channel number are H input/8×Winput/8×256、Hinput/16×Winput/16X 512 and H input/32×Winput/32X 1024, respectively.
In this embodiment, as shown in FIG. 1, the thermal infrared mode bottleneck structure is formed byAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresThe processing process of the multi-scale characteristics of the visible light and thermal infrared modes is similar to that of the fusion mode, and the structure of the PAN module is the same as that of the fusion mode on the PAN modules 2, 3, 4 and 5, but parameters independent of the corresponding structure of the fusion mode are used.
In this embodiment, the basic convolution modules in the backbone network part and the bottleneck part are composed of 1×1 and 3×3 convolution layers, a batch normalization layer and Sigmoid weighted linear units.
In this embodiment, the detection module in step S4 is composed of a1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.
In this embodiment, the decision-level fusion of the detection results of the three modes in step S4 refers to fusing the same-scale detection results of the visible light and thermal infrared modesAndWeighted average is carried out to obtain the final detection resultAs shown in the following formula. Wherein the weight coefficients lambda F、λI and lambda R of the three modes are 0.5, 0.25 and 0.25 respectively.
In this embodiment, the detection results of the three modes in step S4 are respectively used for supervised training by matching the loss function with the training label. And taking a label shared by visible light and thermal infrared as a real label to be used as a supervision of a fusion mode detection result. In order to realize single-mode detection auxiliary supervision, a label unique to a visible light mode is used as supervision of a visible light mode detection result, and a label unique to a thermal infrared mode is used as supervision of a thermal infrared detection result. The total loss function for this step is L s4, where L cls、Lbbox and L obj are the classification loss, bounding box loss, and confidence loss of the fusion modality prediction. p_f is the prediction result of the fusion mode, which is composed ofAnd merging to obtain the product. gt_f is a training label of modality sharing. L obj_v、Lbbox_v is the confidence loss and bounding box loss for the visible-light model prediction task, and L obj_t、Lbbox_t is the confidence loss and bounding box loss for the thermal-infrared model prediction task. p_v is the single-mode prediction result of the visible light mode, which is composed ofAnd merging. p_t is the thermal infrared mode single-mode prediction result, which is respectively composed ofAnd merging. gt_v and gt_t are respectively a visible light mark and a thermal infrared mark. Independent labeling of visible and thermal infrared modes can provide mode-specific supervisory information for the visible and thermal infrared modes.
L cls、Lobj、Lobj_v、Lobj_t is the cross entropy loss and L bbox、Lbbox_v、Lbbox_t is the CIoU loss. Lambda cls、λobj and lambda bbox are the corresponding weights of the classification penalty, confidence penalty and bounding box penalty, respectively 0.5,0.5,0.05.
Ldet=λclsLcls(p_f,gt_f)+λobjLobj(p_f,gt_f)+λbboxLbbox(p_f,gt_f)+λobjLobj_v(p_v,gt_v)+λbboxLbbox_v(p_v,gt_v)+λobjLobj_t(p_t,gt_t)+λbboxLbbox_t(p_t,gt_t)
In this embodiment, the training is performed in a deep supervision manner after step S4, and the total loss function is as follows:
Ltotal=Lseg+Ldet
Wherein, L seg and L det are the semantic segmentation loss of step S2 and the target detection loss of step S4, respectively.
To verify the validity of MSAS-YOLOv, the present embodiment uses the public dataset KAIST for training and testing of the network framework and is compared to other methods. The KAIST dataset contains 7595 pairs of visible and infrared images versus training sets, 2252 pairs of visible and infrared images versus test sets. The picture size is 512 x 640 pixels. And the method can be divided into three sub-data sets of whole day, day and night, and the performance of the multi-source target detection algorithm on the whole data, day data and night data is tested respectively.
The algorithm proposed in this example is compared with 5 latest change detection methods ,CIAN(Cross-modality Interactive Attention Network)、MSDS-RCNN(Multispectral Simultaneous Detection and Segmentation R-CNN)、AR-CNN(Aligned Region CNN)、MBNet(Modality Balance Network) and UGCML (noncertainty-Guided Cross-model Learning), and the specific results are shown in table 1. The total number of evaluation indexes is 2, and the evaluation is performed by using an average logarithmic omission ratio (MR -2 for short) index in accuracy. The smaller the index value is, the less the omission factor of the target detector is, and the higher the accuracy is. The speed is measured using Frames Per Second (FPS). As can be seen from the combination of Table 1, the average logarithmic omission ratio index of the method (MSAS-YOLOv) of the embodiment on the all-day sub-dataset reaches the lowest 5.89%, which shows that the method has the highest detection precision. 24.39FPS was reached at speed, exceeding other multi-source target detection methods in the table. Compared to the second best (MBNet), MSAS-YOLOv5 reduced the average log-miss rate by 2.27% over the full day sub-dataset, by 1.01% over the daytime sub-dataset, and by 4.24% over the night sub-dataset. Fig. 2 is a comparison schematic diagram of three sets of multi-source target detection results of the method of the embodiment and the main stream existing method, and as can be seen from the first row of fig. 2, the detection result of the method of the embodiment is more comprehensive and more approximate to the actual detection situation. The second line of input images has many small pedestrian targets, and the method of the embodiment can accurately detect small pedestrians, while other comparison methods have the condition of missed detection. The detection result of the method in the embodiment still solves the real detection result more than other methods for the input image of the third behavior night scene.
Table 1 is a comparative table of test results for the methods of the examples of the present invention and other prior art methods
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.
Claims (8)
1. The RGB-infrared multi-source image target detection method based on the single-mode auxiliary supervision and YOLOv is characterized by comprising four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part; the backbone network part is used for extracting the convolution characteristic diagrams of the visible light and the thermal infrared images respectively; the bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode; the segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer and takes the feature map as an auxiliary supervision task; the detection part carries out target detection on each level of characteristics of visible light and thermal infrared modes, and carries out decision-level fusion on the fusion, the prediction results of the visible light and the thermal infrared modes; monitoring prediction of visible light and thermal infrared modes by labels in respective modes, and providing auxiliary information of single-mode prediction;
The method specifically comprises the following steps:
Step 1: the visible light and thermal infrared images I rgb and I ir are input into a backbone network part for feature extraction, and the visible light and thermal infrared images I rgb and I ir sequentially pass through a visible light convolution module And thermal infrared convolution moduleWherein i is {1,2,3,4,5}, a plurality of feature pairs are obtainedAndWherein the method comprises the steps ofVisible light convolution module for visible light image I rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I ir Post-extracted features; at the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics
Step 2: convoluting the visible light in the step 1And thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;
Step 3: characterization of backbone network part in step 1 Features from a spatial pyramid pooling moduleFeeding into the bottleneck section; in order to realize single-mode independent prediction, a bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted; wherein the modality bottleneck structure is fused AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputThermal infrared mode bottleneck structureAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale features
Step 4: performing target detection on the three-to-multi-scale features obtained in the bottleneck part of the three modes in the step 3 by using a detection module to obtain object prediction under each mode and each scale; for prediction of three modes of fusion, thermal infrared and visible light, fusion of prediction under the same scale is carried out by using decision-level fusion; and finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input multi-source image.
2. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the visible light convolution module in the step1And thermal infrared convolution moduleEach of which is formed by serially connecting a basic convolution module and a cross-stage local convolution module; the cross-stage local convolution module splits an input feature map into two feature maps F p1 and F p2,Fp1 according to channels, residual connection is carried out only through one basic convolution module, F p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.
3. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the visible light convolution moduleAnd thermal infrared convolution moduleThe parameters are independent.
4. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the split convolution module in the step 2 refers to a structure formed by serially connecting 3×3 convolution layers, batch normalization, correction linear units and 3×3 convolution layers.
5. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the three modality bottleneck section in step 3 processes the multi-scale feature map using a path aggregation network PAN structure that includes top-down and bottom-up paths.
6. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighting linear unit.
7. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: in the step 4, the detection module is formed by a1 multiplied by 1 convolution layer, target detection is carried out on three layers of characteristic diagrams with the step length of 8, 16 and 32 respectively, and the parameters of the three layers of detection modules are independent; in order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.
8. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: in the step 4, the decision-level fusion of the detection results of the three modes refers to weighted average of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310211247.7A CN116665036B (en) | 2023-03-07 | 2023-03-07 | RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310211247.7A CN116665036B (en) | 2023-03-07 | 2023-03-07 | RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116665036A CN116665036A (en) | 2023-08-29 |
CN116665036B true CN116665036B (en) | 2024-09-17 |
Family
ID=87724909
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310211247.7A Active CN116665036B (en) | 2023-03-07 | 2023-03-07 | RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116665036B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117975040B (en) * | 2024-03-28 | 2024-06-18 | 南昌工程学院 | GIS infrared image recognition system and method based on improvement YOLOv5 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767882A (en) * | 2020-07-06 | 2020-10-13 | 江南大学 | Multi-mode pedestrian detection method based on improved YOLO model |
CN113361466A (en) * | 2021-06-30 | 2021-09-07 | 江南大学 | Multi-modal cross-directed learning-based multi-spectral target detection method |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111209810B (en) * | 2018-12-26 | 2023-05-26 | 浙江大学 | Boundary frame segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time through visible light and infrared images |
CN112529878B (en) * | 2020-12-15 | 2024-04-02 | 西安交通大学 | Multi-view semi-supervised lymph node classification method, system and equipment |
CN113627504B (en) * | 2021-08-02 | 2022-06-14 | 南京邮电大学 | Multi-mode multi-scale feature fusion target detection method based on generation of countermeasure network |
CN115331162A (en) * | 2022-07-14 | 2022-11-11 | 西安科技大学 | Cross-scale infrared pedestrian detection method, system, medium, equipment and terminal |
CN115713679A (en) * | 2022-10-13 | 2023-02-24 | 北京大学 | Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map |
-
2023
- 2023-03-07 CN CN202310211247.7A patent/CN116665036B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767882A (en) * | 2020-07-06 | 2020-10-13 | 江南大学 | Multi-mode pedestrian detection method based on improved YOLO model |
CN113361466A (en) * | 2021-06-30 | 2021-09-07 | 江南大学 | Multi-modal cross-directed learning-based multi-spectral target detection method |
Also Published As
Publication number | Publication date |
---|---|
CN116665036A (en) | 2023-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112884064B (en) | Target detection and identification method based on neural network | |
Aboah et al. | Real-time multi-class helmet violation detection using few-shot data sampling technique and yolov8 | |
CN112101221B (en) | Method for real-time detection and identification of traffic signal lamp | |
CN112069940B (en) | Cross-domain pedestrian re-identification method based on staged feature learning | |
CN111767882A (en) | Multi-mode pedestrian detection method based on improved YOLO model | |
CN113420607A (en) | Multi-scale target detection and identification method for unmanned aerial vehicle | |
Wu et al. | UAV imagery based potential safety hazard evaluation for high-speed railroad using Real-time instance segmentation | |
CN111368754B (en) | Airport runway foreign matter detection method based on global context information | |
CN110503161B (en) | Ore mud ball target detection method and system based on weak supervision YOLO model | |
CN111582092B (en) | Pedestrian abnormal behavior detection method based on human skeleton | |
CN112434723B (en) | Day/night image classification and object detection method based on attention network | |
Mei et al. | Asymmetric global–local mutual integration network for RGBT tracking | |
CN114821014A (en) | Multi-mode and counterstudy-based multi-task target detection and identification method and device | |
CN115272652A (en) | Dense object image detection method based on multiple regression and adaptive focus loss | |
CN116798070A (en) | Cross-mode pedestrian re-recognition method based on spectrum sensing and attention mechanism | |
CN116665036B (en) | RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 | |
Su et al. | FSRDD: An efficient few-shot detector for rare city road damage detection | |
CN116342894A (en) | GIS infrared feature recognition system and method based on improved YOLOv5 | |
CN117333948A (en) | End-to-end multi-target broiler behavior identification method integrating space-time attention mechanism | |
CN112949510A (en) | Human detection method based on fast R-CNN thermal infrared image | |
Tang et al. | Foreign object detection for transmission lines based on Swin Transformer V2 and YOLOX | |
CN112395953A (en) | Road surface foreign matter detection system | |
Song et al. | MsfNet: a novel small object detection based on multi-scale feature fusion | |
Kumar et al. | Improved YOLOv4 approach: a real time occluded vehicle detection | |
CN117495825A (en) | Method for detecting foreign matters on tower pole of transformer substation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |