[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN116665036B - RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 - Google Patents

RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 Download PDF

Info

Publication number
CN116665036B
CN116665036B CN202310211247.7A CN202310211247A CN116665036B CN 116665036 B CN116665036 B CN 116665036B CN 202310211247 A CN202310211247 A CN 202310211247A CN 116665036 B CN116665036 B CN 116665036B
Authority
CN
China
Prior art keywords
mode
visible light
thermal infrared
target detection
rgb
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310211247.7A
Other languages
Chinese (zh)
Other versions
CN116665036A (en
Inventor
张秀伟
张艳宁
倪涵
汪进中
王文娜
邢颖慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202310211247.7A priority Critical patent/CN116665036B/en
Publication of CN116665036A publication Critical patent/CN116665036A/en
Application granted granted Critical
Publication of CN116665036B publication Critical patent/CN116665036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Radiation Pyrometers (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv, and designs an RGB-infrared multisource YOLOv image target detection model MSAS-YOLOv of the single-mode auxiliary supervision. The model uses YOLOv which takes the speed and the precision into consideration as a basic method, and efficiently extracts the multi-level characteristics of RGB visible light and thermal infrared images. A method for single-mode auxiliary supervision is designed, and semantic segmentation auxiliary tasks and single-mode detection auxiliary tasks are combined. Independent target detection prediction branches are arranged for the visible light and thermal infrared modes, and independent visible light thermal infrared mode labels are used for respectively monitoring predictions of the two modes. In terms of accuracy, the MSAS-YOLOv achieves an average logarithmic omission ratio of 5.89% on the KAIST dataset, and has smaller omission degree supervision on a target detection object. The FPS reaches 24.39, and the method has higher speed than the mainstream multisource target detection method.

Description

RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5
Technical Field
The invention belongs to the technical field of computer vision image processing, and particularly relates to an RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv.
Background
Object detection is an important area of computer vision. Image-based object detection detects the object category and object position present in an image through image processing and computer vision techniques. The multi-source RGB-infrared image is used for collaborative target detection, so that the problem that the identification effect is poor under the conditions of bad weather and poor illumination in RGB image target detection can be solved. The RGB-infrared multi-source image target detection technology has important application in the fields of automatic driving, video monitoring, security protection and the like.
In recent years, deep learning techniques typified by convolutional neural networks have become a mainstream method in the field of target detection. The object detection method based on deep learning can be largely classified into a single-stage object detection represented by YOLO series and a double-stage object detection represented by FASTER RCNN series. Deep learning-based RGB-infrared multi-source target detection is also widely studied, and is superior to multi-source target detection based on the traditional method. For example, zhang et al in literature "Weakly aligned cross-modal learning for multispectral pedestrian detection[C].Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5127-5137." designed a multisource target detection framework based on Faster R-CNN. And provides visible light and thermal infrared mode independent labeling for the multi-source target detection data set because of misalignment in the multi-source target detection data set. I.e. objects in the visible and thermal infrared images have different positions. This misalignment will affect the supervision of the multi-source object detection network. Kim et al in document "Uncertainty-guided cross-modal learning for robust multispectral pedestrian detection[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(3):1510-1523." propose a FASTER RCNN based uncertainty aware multi-source pedestrian detection framework that enables better multi-source modality discrimination features.
However, the current deep learning RGB-infrared multi-source target detection method has some problems: 1. most of the current multi-source target detection methods are based on the Faster R-CNN double-stage target detection method, so that the reasoning speed is low; 2. existing work rarely explicitly utilizes independent labeling of visible light and thermal infrared to provide additional ancillary information for object detection. The visible light and thermal infrared mode independent labels can provide more accurate characteristic learning supervision for the visible light and thermal infrared backbone network; therefore, it is necessary to design a single-mode auxiliary monitoring multisource YOLOv target detection method which is based on a rapid single-stage target detection method and is monitored by combining auxiliary information.
Disclosure of Invention
Technical problem to be solved
Aiming at the problems that the existing multi-source target detection method is low in detection speed and auxiliary information is not fully utilized to promote detection supervision, the invention provides the RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv.
Technical proposal
An RGB-infrared multi-source image target detection model based on single-mode auxiliary supervision and YOLOv is characterized by comprising four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part; the backbone network part is used for extracting the convolution characteristic diagrams of the visible light and the thermal infrared images respectively; the bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode; the segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer and takes the feature map as an auxiliary supervision task; the detection part carries out target detection on each level of characteristics of visible light and thermal infrared modes, and carries out decision-level fusion on the fusion, the prediction results of the visible light and the thermal infrared modes; and the prediction of the visible light and thermal infrared modes is supervised by labels in the respective modes, and the auxiliary information of single-mode prediction is provided.
An RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv is characterized by comprising the following steps:
Step 1: the visible light and thermal infrared images I rgb and I ir are input into a backbone network part for feature extraction, and the visible light and thermal infrared images I rgb and I ir sequentially pass through a visible light convolution module And thermal infrared convolution moduleWherein i is {1,2,3,4,5}, a plurality of feature pairs are obtainedAndWherein the method comprises the steps ofVisible light convolution module for visible light image I rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I ir Post-extracted features; at the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics
Step 2: convoluting the visible light in the step 1And thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;
Step 3: characterization of backbone network part in step 1 Features from a spatial pyramid pooling moduleFeeding into the bottleneck section; in order to realize single-mode independent prediction, a bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted; wherein the modality bottleneck structure is fused AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputThermal infrared mode bottleneck structureAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale features
Step 4: performing target detection on the three-to-multi-scale features obtained in the bottleneck part of the three modes in the step 3 by using a detection module to obtain object prediction under each mode and each scale; for prediction of three modes of fusion, thermal infrared and visible light, fusion of prediction under the same scale is carried out by using decision-level fusion; and finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input multi-source image.
The invention further adopts the technical scheme that: the visible light convolution module in the step 1And thermal infrared convolution moduleEach of which is formed by serially connecting a basic convolution module and a cross-stage local convolution module; the cross-stage local convolution module splits an input feature map into two feature maps F p1 and F p2,Fp1 according to channels, residual connection is carried out only through one basic convolution module, F p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.
The invention further adopts the technical scheme that: the visible light convolution moduleAnd thermal infrared convolution moduleThe parameters are independent.
The invention further adopts the technical scheme that: the split convolution module in the step 2 refers to a structure formed by serially connecting 3×3 convolution layers, batch normalization, correction linear units and 3×3 convolution layers.
The invention further adopts the technical scheme that: the three modality bottleneck section in step3 processes the multi-scale feature map using a path aggregation network PAN structure that includes top-down and bottom-up paths.
The invention further adopts the technical scheme that: the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighting linear unit.
The invention further adopts the technical scheme that: the detection module in step 4 is composed of a1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.
The invention further adopts the technical scheme that: in the step 4, the decision-level fusion of the detection results of the three modes refers to weighted average of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.
Advantageous effects
The invention provides a RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv, which designs an RGB-infrared multisource YOLOv image target detection model MSAS-YOLOv5 of the single-mode auxiliary supervision. The model uses YOLOv which takes the speed and the precision into consideration as a basic method, and efficiently extracts the multi-level characteristics of RGB visible light and thermal infrared images. A method for single-mode auxiliary supervision is designed, and semantic segmentation auxiliary tasks and single-mode detection auxiliary tasks are combined. Independent target detection prediction branches are arranged for the visible light and thermal infrared modes, and independent visible light thermal infrared mode labels are used for respectively monitoring predictions of the two modes. In terms of accuracy, the MSAS-YOLOv achieves an average logarithmic omission ratio of 5.89% on the KAIST dataset, and has smaller omission degree supervision on a target detection object. The FPS reaches 24.39, and the method has higher speed than the mainstream multisource target detection method.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.
Fig. 1 is a network configuration diagram of a method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram showing comparison of target detection results of the method according to the embodiment of the present invention and other prior art methods.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
A RGB-infrared multisource image target detection method based on single-mode auxiliary supervision and YOLOv is characterized in that a new single-mode auxiliary supervision multisource YOLOv target detection model is built for RGB-infrared multisource image target detection by utilizing a single-stage target detection method YOLOv which takes both speed and precision into consideration and introducing single-mode auxiliary supervision. The detection model comprises four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part. The backbone network part is used for extracting the convolution characteristic graphs of the visible light and the thermal infrared images respectively. The bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode. The segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer, and serves as an auxiliary supervision task. The detection part performs target detection on each level of characteristics of the visible light and thermal infrared modes, and performs decision-level fusion on the fusion and prediction results of the visible light and thermal infrared modes. And the prediction of the visible light and thermal infrared modes is supervised by labels in the respective modes, and the auxiliary information of single-mode prediction is provided.
The method specifically comprises the following steps:
Step 1: the visible and thermal infrared images I rgb and I ir are input into the backbone network part for feature extraction. The visible light and thermal infrared images I rgb and I ir sequentially pass through a visible light convolution module And thermal infrared convolution module(Where i.epsilon. {1,2,3,4,5 }) to obtain a plurality of feature pairsAndWherein the method comprises the steps ofVisible light convolution module for visible light image I rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I ir Post-extracted features. At the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics
Step 2: convoluting the visible light in the step 1And thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;
Step 3: characterization of backbone network part in step 1 Features from a spatial pyramid pooling moduleInto the bottleneck section. In order to realize single-mode independent prediction, the bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted. Wherein the modality bottleneck structure is fused AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputThermal infrared mode bottleneck structureAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresIn order to reduce the parameter quantity and improve the efficiency, the bottleneck part structural parameters of the visible light and the thermal infrared modes are shared;
step 4: and (3) performing target detection on the three-to-multi-scale features obtained in the bottleneck part of the three modes in the step (3) by using a detection module to obtain object prediction under each mode and each scale. For prediction of three modes of fusion, thermal infrared and visible light, decision-level fusion is used for fusion of predictions under the same scale. And finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input multi-source image.
Optionally, the visible light convolution module in step 1And thermal infrared convolution moduleAre each formed by a basic convolution module and a cross-stage local convolution module which are connected in series. The cross-stage local convolution module splits an input feature map into two feature maps F p1 and F p2,Fp1 according to channels, residual connection is carried out only through one basic convolution module, F p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.
Optionally, the visible light convolution module in step 1And thermal infrared convolution moduleThe parameters are independent.
Optionally, the split convolution module in step 2 refers to a structure formed by concatenating 3×3 convolution layers, batch normalization, and correction linear units, and 3×3 convolution layers.
Optionally, the semantic segmentation predictions predicted in step 2 will be used for assisted supervision at training time. And in the training stage of the network model, using the target detection boundary box label as a mask, and supervising the semantic segmentation prediction result obtained in the step 2.
Optionally, the three modality bottleneck section in step 3 processes the multi-scale feature map using a path aggregation network PAN (Path Aggregation Network) structure that contains top-down and bottom-up paths.
Optionally, the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighted linear unit.
Optionally, the detection module in step 4 is formed by a 1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.
Optionally, the decision-level fusion of the detection results of the three modes in the step 4 refers to weighted averaging of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.
Optionally, the detection results of the three modes in the step 4 are respectively subjected to supervised training by using the loss function and matching with training marks. And taking a label shared by visible light and thermal infrared as a real label to be used as a supervision of a fusion mode detection result. In order to realize single-mode detection auxiliary supervision, a label unique to a visible light mode is used as supervision of a visible light mode detection result, and a label unique to a thermal infrared mode is used as supervision of a thermal infrared detection result.
Specific examples:
As shown in FIG. 1, the invention designs an RGB-infrared multi-source image target detection model based on single-mode auxiliary supervision and YOLOv aiming at the problems that the existing multi-source target detection method is insufficient in speed and less in comprehensive utilization of effective supervision information. Which includes a backbone network part, a bottleneck part, a partition part, and a prediction part. The specific method comprises the following steps:
S1, inputting visible light and thermal infrared images I rgb and I ir into a backbone network part for feature extraction to obtain a plurality of feature pairs AndWherein the method comprises the steps ofVisible light convolution module for visible light image I rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I ir Post-extracted features. At the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics
S2, as shown in FIG. 1, the visible light convolution moduleAnd thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;
s3, the characteristics of the backbone network part in the step 1 are determined Features from a spatial pyramid pooling moduleInto the bottleneck section. In order to realize single-mode independent prediction, the bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted. Wherein the modal bottleneck structure is fused to AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputTo be used forAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresIn order to reduce the parameter quantity and improve the efficiency, the bottleneck part structural parameters of the visible light and the thermal infrared modes are shared;
S4, using the detection module to obtain three groups of multi-scale multi-mode features (fusion mode features) Visible light mode characteristicsThermal infrared morphological features) And performing target detection. Respectively obtaining multi-scale prediction results of three groups of modesFor prediction of three modes of fusion, thermal infrared and visible light, decision-level fusion is used for fusion of predictions under the same scale. And finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input image.
In this embodiment, the execution network of step S1-step S4 is simply referred to as MSAS-YOLOv. The execution of steps S1-S4 will be described in further detail below in conjunction with the structure of MSAS-YOLOv.
In this embodiment, in step S1, the visible light image I rgb with a size of H input×Winput ×3 and the thermal infrared image I ir with a size of H input×Winput ×3 are sent to the backbone network part for feature extraction. Each convolution moduleThe feature map is reduced in size by 1/2. After passing through the convolution module i, a feature map with the size of H input/2i×Winput/2i is obtained. Wherein the method comprises the steps ofAnd the output feature size and channel number of the pyramid pooling module are H input/8×Winput/8×256、Hinput/16×Winput/16×512 and H input/32×Winput/32×1024 respectively.And output features of a spatial pyramid pooling moduleAndInto the bottleneck section and the dividing section.
In this embodiment, the visible light convolution moduleAnd thermal infrared convolution moduleAre each formed by a basic convolution module and a cross-stage local convolution module which are connected in series. The cross-stage local convolution module splits an input feature map into two feature maps F p1 and F p2,Fp1 according to channels, residual connection is carried out through only one basic convolution module, F p2 is carried out through n basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained. (AndAnd taking 1, 3, 6, 9 and 3 from n corresponding to 1 to 5, and finally splicing the two part characteristic diagrams and obtaining final output through a convolution module.
In this embodiment, in step S1, the parameters of the visible light convolution module and the thermal infrared convolution module are independent.
In this embodiment, the split convolution module in step S2 refers to a structure formed by concatenating 3×3 convolution layers, batch normalization, correction linear units, and 3×3 convolution layers. The first 3 x 3 convolutional layer has 256 output channels. The final 3 x 3 convolutional layer output channel number is1, indicating the probability of containing an object per position prediction. Finally, semantic segmentation prediction with the size of H input/8×Winput/8 is obtained.
In this embodiment, the semantic segmentation predictions predicted in step S2 will be used for assisted supervision at training time. In the training phase of the network model, the target detection boundary box label is used as a mask. And (3) setting the position of the object contained in the original image as 1, scaling to 1/8 of the size of the original image as a semantic segmentation training label G seg, and supervising the semantic segmentation prediction result P seg obtained in the step S2. And a semantic segmentation auxiliary supervision is formed by using the loss function L seg. Wherein lambda seg is the weight of semantic segmentation prediction loss, and is set to 0.5.L bce is a two-class cross entropy loss.
Lseg=λsegLbce(Pseg,Gseg)
In this embodiment, as shown in fig. 1, the bottleneck portion in step S3 is provided with three groups of structures including a fusion mode bottleneck, a thermal infrared mode bottleneck, and a visible mode bottleneck. Three bottleneck sections each process a multi-scale feature map using a via aggregation network PAN (Path Aggregation Network) structure that includes top-down and bottom-up vias.
In this example, as shown in FIG. 1, the bottleneck structure of the modality is fusedAndIs input. The top-down path comprises PAN modules 1 to 3, and the bottom-up path comprises PAN module 4 and PAN module 5.PAN module 1 uses basic convolution module processingTo treat the processed characteristicsInto PAN module 5 and thenUpsampling to twice the size givesInto the PAN module 2.
In this embodiment, the PAN module 2 and the PAN module 3 include a channel splicing and basic convolution module. As shown in fig. 1, PAN module 2 and PAN module 3 aggregate PAN module features from the previous level, and from the corresponding scaleThe PAN module 4 and the PAN module 5 comprise a basic convolution module, a channel splicing and an n-ary 3 cross-stage local convolution module. The basic convolution module step size of the PAN module 4 and the PAN module 5 is set to 2, and the PAN module from the previous stage is reduced in scale. And splicing with a PAN module with the same scale, and finally obtaining output characteristics through a cross-stage local convolution module. Output multiscale features of PAN modules 3, 4, 5, respectivelyThe output feature size and channel number are H input/8×Winput/8×256、Hinput/16×Winput/16X 512 and H input/32×Winput/32X 1024, respectively.
In this embodiment, as shown in FIG. 1, the thermal infrared mode bottleneck structure is formed byAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale featuresThe processing process of the multi-scale characteristics of the visible light and thermal infrared modes is similar to that of the fusion mode, and the structure of the PAN module is the same as that of the fusion mode on the PAN modules 2, 3, 4 and 5, but parameters independent of the corresponding structure of the fusion mode are used.
In this embodiment, the basic convolution modules in the backbone network part and the bottleneck part are composed of 1×1 and 3×3 convolution layers, a batch normalization layer and Sigmoid weighted linear units.
In this embodiment, the detection module in step S4 is composed of a1×1 convolution layer. And performing target detection on the three layers of feature graphs with the step sizes of 8, 16 and 32 respectively, wherein the three layers of detection module parameters are independent. In order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.
In this embodiment, the decision-level fusion of the detection results of the three modes in step S4 refers to fusing the same-scale detection results of the visible light and thermal infrared modesAndWeighted average is carried out to obtain the final detection resultAs shown in the following formula. Wherein the weight coefficients lambda F、λI and lambda R of the three modes are 0.5, 0.25 and 0.25 respectively.
In this embodiment, the detection results of the three modes in step S4 are respectively used for supervised training by matching the loss function with the training label. And taking a label shared by visible light and thermal infrared as a real label to be used as a supervision of a fusion mode detection result. In order to realize single-mode detection auxiliary supervision, a label unique to a visible light mode is used as supervision of a visible light mode detection result, and a label unique to a thermal infrared mode is used as supervision of a thermal infrared detection result. The total loss function for this step is L s4, where L cls、Lbbox and L obj are the classification loss, bounding box loss, and confidence loss of the fusion modality prediction. p_f is the prediction result of the fusion mode, which is composed ofAnd merging to obtain the product. gt_f is a training label of modality sharing. L obj_v、Lbbox_v is the confidence loss and bounding box loss for the visible-light model prediction task, and L obj_t、Lbbox_t is the confidence loss and bounding box loss for the thermal-infrared model prediction task. p_v is the single-mode prediction result of the visible light mode, which is composed ofAnd merging. p_t is the thermal infrared mode single-mode prediction result, which is respectively composed ofAnd merging. gt_v and gt_t are respectively a visible light mark and a thermal infrared mark. Independent labeling of visible and thermal infrared modes can provide mode-specific supervisory information for the visible and thermal infrared modes.
L cls、Lobj、Lobj_v、Lobj_t is the cross entropy loss and L bbox、Lbbox_v、Lbbox_t is the CIoU loss. Lambda cls、λobj and lambda bbox are the corresponding weights of the classification penalty, confidence penalty and bounding box penalty, respectively 0.5,0.5,0.05.
Ldet=λclsLcls(p_f,gt_f)+λobjLobj(p_f,gt_f)+λbboxLbbox(p_f,gt_f)+λobjLobj_v(p_v,gt_v)+λbboxLbbox_v(p_v,gt_v)+λobjLobj_t(p_t,gt_t)+λbboxLbbox_t(p_t,gt_t)
In this embodiment, the training is performed in a deep supervision manner after step S4, and the total loss function is as follows:
Ltotal=Lseg+Ldet
Wherein, L seg and L det are the semantic segmentation loss of step S2 and the target detection loss of step S4, respectively.
To verify the validity of MSAS-YOLOv, the present embodiment uses the public dataset KAIST for training and testing of the network framework and is compared to other methods. The KAIST dataset contains 7595 pairs of visible and infrared images versus training sets, 2252 pairs of visible and infrared images versus test sets. The picture size is 512 x 640 pixels. And the method can be divided into three sub-data sets of whole day, day and night, and the performance of the multi-source target detection algorithm on the whole data, day data and night data is tested respectively.
The algorithm proposed in this example is compared with 5 latest change detection methods ,CIAN(Cross-modality Interactive Attention Network)、MSDS-RCNN(Multispectral Simultaneous Detection and Segmentation R-CNN)、AR-CNN(Aligned Region CNN)、MBNet(Modality Balance Network) and UGCML (noncertainty-Guided Cross-model Learning), and the specific results are shown in table 1. The total number of evaluation indexes is 2, and the evaluation is performed by using an average logarithmic omission ratio (MR -2 for short) index in accuracy. The smaller the index value is, the less the omission factor of the target detector is, and the higher the accuracy is. The speed is measured using Frames Per Second (FPS). As can be seen from the combination of Table 1, the average logarithmic omission ratio index of the method (MSAS-YOLOv) of the embodiment on the all-day sub-dataset reaches the lowest 5.89%, which shows that the method has the highest detection precision. 24.39FPS was reached at speed, exceeding other multi-source target detection methods in the table. Compared to the second best (MBNet), MSAS-YOLOv5 reduced the average log-miss rate by 2.27% over the full day sub-dataset, by 1.01% over the daytime sub-dataset, and by 4.24% over the night sub-dataset. Fig. 2 is a comparison schematic diagram of three sets of multi-source target detection results of the method of the embodiment and the main stream existing method, and as can be seen from the first row of fig. 2, the detection result of the method of the embodiment is more comprehensive and more approximate to the actual detection situation. The second line of input images has many small pedestrian targets, and the method of the embodiment can accurately detect small pedestrians, while other comparison methods have the condition of missed detection. The detection result of the method in the embodiment still solves the real detection result more than other methods for the input image of the third behavior night scene.
Table 1 is a comparative table of test results for the methods of the examples of the present invention and other prior art methods
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.

Claims (8)

1. The RGB-infrared multi-source image target detection method based on the single-mode auxiliary supervision and YOLOv is characterized by comprising four parts: a backbone network part, a bottleneck part, a segmentation part and a detection part; the backbone network part is used for extracting the convolution characteristic diagrams of the visible light and the thermal infrared images respectively; the bottleneck part is used for combining the feature graphs of the visible light and the thermal infrared modes from top to bottom and from bottom to top to obtain each level of features for the fusion mode, the visible light mode and the thermal infrared mode; the segmentation part predicts semantic segmentation of the visible light and thermal infrared mode feature map through a convolution layer and takes the feature map as an auxiliary supervision task; the detection part carries out target detection on each level of characteristics of visible light and thermal infrared modes, and carries out decision-level fusion on the fusion, the prediction results of the visible light and the thermal infrared modes; monitoring prediction of visible light and thermal infrared modes by labels in respective modes, and providing auxiliary information of single-mode prediction;
The method specifically comprises the following steps:
Step 1: the visible light and thermal infrared images I rgb and I ir are input into a backbone network part for feature extraction, and the visible light and thermal infrared images I rgb and I ir sequentially pass through a visible light convolution module And thermal infrared convolution moduleWherein i is {1,2,3,4,5}, a plurality of feature pairs are obtainedAndWherein the method comprises the steps ofVisible light convolution module for visible light image I rgb The features that are extracted later are used to determine,Thermal infrared convolution module for thermal infrared image I ir Post-extracted features; at the position ofAndAfter that, the characteristics areAndChannel splicing is carried out, and the channel splicing is sent into a space pyramid pooling module to obtain the processed multi-source characteristics
Step 2: convoluting the visible light in the step 1And thermal infrared convolution moduleOutput characteristic diagramAndPerforming channel splicing, and then obtaining a semantic segmentation prediction result through a segmentation convolution module;
Step 3: characterization of backbone network part in step 1 Features from a spatial pyramid pooling moduleFeeding into the bottleneck section; in order to realize single-mode independent prediction, a bottleneck part is provided with three groups of structures of a fusion mode bottleneck, a thermal infrared mode bottleneck and a visible light mode bottleneck, and three-to-multi-scale characteristics of the fusion mode, the thermal infrared mode and the visible light mode are respectively extracted; wherein the modality bottleneck structure is fused AndFor input, willAndChannel stitching is carried out as multi-source feature processing, and multi-scale features are outputThermal infrared mode bottleneck structureAndOutputting thermal infrared multi-scale features as inputsVisible light mode bottleneck structureAndFor inputting and outputting visible light multi-scale features
Step 4: performing target detection on the three-to-multi-scale features obtained in the bottleneck part of the three modes in the step 3 by using a detection module to obtain object prediction under each mode and each scale; for prediction of three modes of fusion, thermal infrared and visible light, fusion of prediction under the same scale is carried out by using decision-level fusion; and finally, carrying out non-maximum suppression post-processing on the three-scale detection results obtained through fusion to obtain target detection prediction output of the input multi-source image.
2. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the visible light convolution module in the step1And thermal infrared convolution moduleEach of which is formed by serially connecting a basic convolution module and a cross-stage local convolution module; the cross-stage local convolution module splits an input feature map into two feature maps F p1 and F p2,Fp1 according to channels, residual connection is carried out only through one basic convolution module, F p2 is carried out through a plurality of basic convolution modules, and finally the two feature maps are subjected to channel splicing and pass through the final basic convolution module, so that output of the cross-stage local convolution module is obtained.
3. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the visible light convolution moduleAnd thermal infrared convolution moduleThe parameters are independent.
4. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the split convolution module in the step 2 refers to a structure formed by serially connecting 3×3 convolution layers, batch normalization, correction linear units and 3×3 convolution layers.
5. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the three modality bottleneck section in step 3 processes the multi-scale feature map using a path aggregation network PAN structure that includes top-down and bottom-up paths.
6. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: the basic convolution modules in the backbone network part and the bottleneck part are composed of a convolution layer, a batch normalization layer and a Sigmoid weighting linear unit.
7. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: in the step 4, the detection module is formed by a1 multiplied by 1 convolution layer, target detection is carried out on three layers of characteristic diagrams with the step length of 8, 16 and 32 respectively, and the parameters of the three layers of detection modules are independent; in order to reduce the number of learning parameters, the fusion, visible light and thermal infrared mode detection modules share parameters.
8. The RGB-infrared multi-source image target detection method based on single-mode aided supervision and YOLOv as claimed in claim 1, wherein: in the step 4, the decision-level fusion of the detection results of the three modes refers to weighted average of the detection results of the fusion, visible light and thermal infrared modes, and the weight coefficients of the three modes are respectively 0.5, 0.25 and 0.25.
CN202310211247.7A 2023-03-07 2023-03-07 RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5 Active CN116665036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310211247.7A CN116665036B (en) 2023-03-07 2023-03-07 RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310211247.7A CN116665036B (en) 2023-03-07 2023-03-07 RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5

Publications (2)

Publication Number Publication Date
CN116665036A CN116665036A (en) 2023-08-29
CN116665036B true CN116665036B (en) 2024-09-17

Family

ID=87724909

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310211247.7A Active CN116665036B (en) 2023-03-07 2023-03-07 RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5

Country Status (1)

Country Link
CN (1) CN116665036B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117975040B (en) * 2024-03-28 2024-06-18 南昌工程学院 GIS infrared image recognition system and method based on improvement YOLOv5

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767882A (en) * 2020-07-06 2020-10-13 江南大学 Multi-mode pedestrian detection method based on improved YOLO model
CN113361466A (en) * 2021-06-30 2021-09-07 江南大学 Multi-modal cross-directed learning-based multi-spectral target detection method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209810B (en) * 2018-12-26 2023-05-26 浙江大学 Boundary frame segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time through visible light and infrared images
CN112529878B (en) * 2020-12-15 2024-04-02 西安交通大学 Multi-view semi-supervised lymph node classification method, system and equipment
CN113627504B (en) * 2021-08-02 2022-06-14 南京邮电大学 Multi-mode multi-scale feature fusion target detection method based on generation of countermeasure network
CN115331162A (en) * 2022-07-14 2022-11-11 西安科技大学 Cross-scale infrared pedestrian detection method, system, medium, equipment and terminal
CN115713679A (en) * 2022-10-13 2023-02-24 北京大学 Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767882A (en) * 2020-07-06 2020-10-13 江南大学 Multi-mode pedestrian detection method based on improved YOLO model
CN113361466A (en) * 2021-06-30 2021-09-07 江南大学 Multi-modal cross-directed learning-based multi-spectral target detection method

Also Published As

Publication number Publication date
CN116665036A (en) 2023-08-29

Similar Documents

Publication Publication Date Title
CN112884064B (en) Target detection and identification method based on neural network
Aboah et al. Real-time multi-class helmet violation detection using few-shot data sampling technique and yolov8
CN112101221B (en) Method for real-time detection and identification of traffic signal lamp
CN112069940B (en) Cross-domain pedestrian re-identification method based on staged feature learning
CN111767882A (en) Multi-mode pedestrian detection method based on improved YOLO model
CN113420607A (en) Multi-scale target detection and identification method for unmanned aerial vehicle
Wu et al. UAV imagery based potential safety hazard evaluation for high-speed railroad using Real-time instance segmentation
CN111368754B (en) Airport runway foreign matter detection method based on global context information
CN110503161B (en) Ore mud ball target detection method and system based on weak supervision YOLO model
CN111582092B (en) Pedestrian abnormal behavior detection method based on human skeleton
CN112434723B (en) Day/night image classification and object detection method based on attention network
Mei et al. Asymmetric global–local mutual integration network for RGBT tracking
CN114821014A (en) Multi-mode and counterstudy-based multi-task target detection and identification method and device
CN115272652A (en) Dense object image detection method based on multiple regression and adaptive focus loss
CN116798070A (en) Cross-mode pedestrian re-recognition method based on spectrum sensing and attention mechanism
CN116665036B (en) RGB-infrared multi-source image target detection method based on single-mode auxiliary supervision and YOLOv5
Su et al. FSRDD: An efficient few-shot detector for rare city road damage detection
CN116342894A (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN117333948A (en) End-to-end multi-target broiler behavior identification method integrating space-time attention mechanism
CN112949510A (en) Human detection method based on fast R-CNN thermal infrared image
Tang et al. Foreign object detection for transmission lines based on Swin Transformer V2 and YOLOX
CN112395953A (en) Road surface foreign matter detection system
Song et al. MsfNet: a novel small object detection based on multi-scale feature fusion
Kumar et al. Improved YOLOv4 approach: a real time occluded vehicle detection
CN117495825A (en) Method for detecting foreign matters on tower pole of transformer substation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant