CN117690161B

CN117690161B - Pedestrian detection method, device and medium based on image fusion

Info

Publication number: CN117690161B
Application number: CN202311704548.XA
Authority: CN
Inventors: 陈明轩; 叶逸航
Original assignee: Shanghai University of Engineering Science
Current assignee: Shanghai University of Engineering Science
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-06-04
Anticipated expiration: 2043-12-12
Also published as: CN117690161A

Abstract

The invention relates to a pedestrian detection method, equipment and medium based on image fusion, which comprises the following steps: s1, acquiring a real-time visible light image and a thermal infrared image and preprocessing the images; s2, respectively carrying out multi-scale feature extraction on the preprocessed visible light image and the preprocessed thermal infrared image for a plurality of times to generate a plurality of visible light feature images and a plurality of thermal infrared feature images; s3, carrying out weighted fusion on the visible light characteristic diagram and the thermal infrared characteristic diagram to obtain a class activation diagram; s4, inputting the class activation diagram into a feature pyramid network to perform multi-scale feature fusion, and generating a fusion feature diagram; and S5, executing a detection task on the fusion feature map, and outputting a pedestrian detection result, wherein the detection task comprises pedestrian prediction boundary box regression and pedestrian prediction boundary box object selection classification. Compared with the prior art, the pedestrian detection method and device improve the accuracy and the instantaneity of the pedestrian detection result.

Description

Pedestrian detection method, device and medium based on image fusion

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a pedestrian detection method, device and medium based on image fusion.

Background

In industry, manual driving of transport vehicles is usually adopted, but the complexity of the night workshop environment and possible misoperation of drivers bring uncertainty factors to driving safety, and life safety and production efficiency of pedestrians are seriously threatened. Object detection is one of the key tasks in the field of computer vision, and pedestrian detection has been remarkably developed as an important branch of object detection, but the detection result thereof depends largely on the quality of an input image. Under the condition of complex illumination conditions or low illumination, the optical imaging sensor is difficult to provide enough information to clearly outline the target outline, and the traditional single-mode target detection technology is difficult to obtain an ideal imaging result, so that the accuracy and reliability of the output result of the pedestrian detection algorithm are directly affected.

In this context, multi-modal object detection techniques have evolved that aim to obtain more comprehensive object information by using multiple sources of data in combination with different sensors. In the existing multi-mode target pedestrian detection method, a plurality of backbone networks are generally used for respectively extracting feature graphs from input modes, then the feature graphs are fused by utilizing an algorithm, and the fusion part allows a detection model to extract detailed information from each input, so that better performance is realized. For example, LEE and the like propose a cascade fusion method, cascade operation is carried out on two modal feature graphs to double the channel number, and then an NiN layer is used for outputting important features, but when the channel number is doubled, the complexity of calculation is increased due to the introduction of redundant calculation amount, the instantaneity is poor, and the deployment of a model is limited; KIM and the like propose a weighted fusion method based on a region of interest, and the region to be fused is selected by judging the feature quantity extracted from the region of interest, but the feature of an unfused region is sacrificed, so that the detection precision of a tiny target is reduced, the superiority of inter-mode fusion is not considered, insufficient mode fusion is caused, and meanwhile, global information in modes is not considered, so that the mode fusion information is lost. Therefore, it is necessary to design a pedestrian detection method, fully utilizing the advantages of modal fusion, and improving the accuracy and instantaneity of pedestrian detection.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a pedestrian detection method, device and medium based on image fusion, which improve the accuracy and real-time performance of pedestrian detection results.

The aim of the invention can be achieved by the following technical scheme:

a pedestrian detection method based on image fusion comprises the following steps:

S1, acquiring a real-time visible light image and a thermal infrared image and preprocessing the images;

s2, respectively carrying out multi-scale feature extraction on the preprocessed visible light image and the preprocessed thermal infrared image for a plurality of times to generate a plurality of visible light feature images and a plurality of thermal infrared feature images;

S3, carrying out weighted fusion on the visible light characteristic diagram and the thermal infrared characteristic diagram to obtain a class activation diagram;

S4, inputting the class activation diagram into a feature pyramid network to perform multi-scale feature fusion, and generating a fusion feature diagram;

and S5, executing a detection task on the fusion feature map, and outputting a pedestrian detection result, wherein the detection task comprises pedestrian prediction boundary box regression and pedestrian prediction boundary box object selection classification.

Further, in step S1, the preprocessing process includes:

unifying the pixel size and format of each image data;

filtering noise reduction and image enhancement are performed.

Further, in step S2, the specific process of multi-scale feature extraction is as follows:

s201, acquiring a preprocessed visible light image and a preprocessed thermal infrared image, sampling isolation pixels in the horizontal direction and the vertical direction, and generating a plurality of visible light image characteristic layers and a plurality of thermal infrared image characteristic layers;

S202, superposing all visible light image feature layers, superposing all thermal infrared image feature layers, respectively inputting the feature layers into a convolution network for feature extraction, and generating a visible light feature map and a thermal infrared feature map.

Further, in step S202, the convolution network includes a convolution layer, a spatial pyramid pooling layer SPP, and a residual block layer.

Further, the specific process of step S3 is as follows:

s301, obtaining a visible light characteristic diagram And thermal infrared signature/>Performing an inner product operation to obtain a first feature mapThen performing a spatial attention operation to obtain a second feature map/>

S302, obtaining a visible light characteristic diagramAnd thermal infrared signature/>Performing addition operation to obtain a third feature mapThen, a convolution operation is carried out to obtain a fourth characteristic diagram/>

S303, combining the second feature mapAnd fourth feature map/>Performing channel self-attention operation to generate class activation graphs

Further, in step S301, the procedure of the spatial attention operation is as follows:

For the first characteristic diagram Respectively carrying out maximum pooling and average pooling, and then respectively carrying out convolution operation;

splicing the results obtained by the convolution operation by using an activation function to obtain a second feature map

Further, in step S303, the process of the channel self-attention operation is as follows:

for the second characteristic diagram And fourth feature map/>After the inner product operation is carried out, respectively carrying out maximum pooling and average pooling;

And respectively weighting the maximum pooling and average pooling results, and then splicing through an activation function.

Further, the activation function is a Sigmoid activation function.

The invention also provides an electronic device comprising a memory, a processor and a program stored in the memory, wherein the processor implements the method according to any one of the above when executing the program.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described in any of the above.

Compared with the prior art, the invention has the following beneficial effects:

1. According to the invention, the preprocessed visible light image and the preprocessed thermal infrared image are respectively subjected to multi-scale feature extraction to generate a plurality of visible light feature images and a plurality of thermal infrared feature images, and then the visible light feature images and the thermal infrared feature images are subjected to weighted fusion to obtain the similar activation images, so that the pedestrian feature information in the visible light feature images and the thermal infrared feature images can be effectively highlighted without losing information, more details and features in the captured image data are facilitated, and the accuracy of pedestrian detection results is improved.

2. According to the invention, the preprocessed visible light image and the preprocessed thermal infrared image are respectively subjected to multi-scale feature extraction, firstly, the isolated pixels are sampled in the horizontal direction and the vertical direction, then, the images are overlapped, and the images are input into a convolution network for feature extraction for many times, and a spatial pyramid pooling layer SPP is used in the convolution network, so that targets with different sizes can be better processed without obviously increasing the size of the network, the model training speed is improved, and the real-time performance of pedestrian detection is further improved.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a pedestrian detection model structure based on image fusion;

FIG. 3 is a schematic diagram of a CAM activation module.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

Examples:

The embodiment firstly builds a pedestrian detection model based on image fusion as shown in fig. 2, and comprises a CSPDARKNET module, a CAM activation module, a feature pyramid network and a detection probe, wherein the CSPDARKNET module is used for respectively carrying out multi-scale feature extraction on a visible light image and a thermal infrared image to obtain a visible light feature map and a thermal infrared feature map; the CAM activation module is used for carrying out weighted fusion on the visible light characteristic diagram and the thermal infrared characteristic diagram to obtain a class activation diagram; the feature pyramid network is used for carrying out multi-scale feature fusion on the class activation graph to generate a fusion feature graph; the detection probe is used for executing detection tasks on the fusion feature map and finally outputting detection results of pedestrian detection.

Based on the pedestrian detection model based on image fusion, the embodiment provides a pedestrian detection method based on image fusion, as shown in fig. 1, comprising the following steps:

S1, acquiring a real-time visible light image and a thermal infrared image and preprocessing.

The pretreatment process comprises the following steps:

unifying the pixel size of each image data to 320×240 pixels, wherein the format is json format;

the filtering noise reduction and the image enhancement are carried out, and common noise reduction modes include mean filtering, gaussian filtering, median filtering and the like, and the image enhancement modes include image enhancement based on a Laplace operator, image enhancement based on logarithmic Log transformation, image enhancement based on the Laplace operator and the like.

S2, respectively inputting the preprocessed visible light images and the preprocessed thermal infrared images into a CSPDARKNET module for multi-scale feature extraction to generate a plurality of visible light feature images and a plurality of thermal infrared feature images, wherein the specific process is as follows:

S201, acquiring preprocessed visible light images and preprocessed thermal infrared images, inputting Focus layers, sampling isolated pixels in the horizontal direction and the vertical direction respectively, generating visible light image characteristic layers and thermal infrared image characteristic layers, recombining each input image into four characteristic layers, and then overlapping the four characteristic layers together, so that the number of input channels is expanded by four times, and the number of the overlapped characteristic layers is increased to 12 channels compared with the original 3-channel input;

S202, inputting the result of the step S201 into a convolution network for three times to extract features, wherein the convolution network comprises a convolution layer, a spatial pyramid pooling layer SPP and a plurality of residual block layers. In the convolution layer, the convolution kernel size is 3 multiplied by 3, the step length is 2, and the filling is 1; the space pyramid pooling layer SPP has three layers, and the convolution kernel sizes are 5×5,7×7 and 9×9 respectively; the residual block layer makes a 1x1 convolution and a3 x3 convolution on the image. After three times of feature extraction, as shown in fig. 2, a first visible light feature map, a second visible light feature map, a third visible light feature map, and a first thermal infrared feature map, a second thermal infrared feature map, and a third thermal infrared feature map are generated. The SPP of the spatial pyramid pooling layer can better process targets with different sizes without obviously increasing the size of a network, so that the model training speed is improved, and the real-time performance of pedestrian detection is further improved

And S3, inputting the visible light characteristic diagram and the thermal infrared characteristic diagram into a CAM activation module for weighted fusion, and obtaining a corresponding class activation diagram.

The CAM activation module can enhance the representation of the feature map, integrate the feature information of the visible light and thermodynamic diagram module, help to capture more details and features, and emphasize that the intra-modal features are not lost due to the inter-modal complementarity. The CAM activation block structure is shown in FIG. 3. First, for visible light characteristic diagramAnd thermal infrared signature/>Performing an inner product operation to obtain a first feature map/>Then performing a spatial attention operation to obtain a second feature map/>For visible light characteristic map/>And thermal infrared signature/>Performing addition operation to obtain a third feature map/>Then, a convolution operation is carried out to obtain a fourth characteristic diagram/>Then, the second feature map/>And fourth feature map/>Performing channel self-attention operation to generate class activation diagram/>The formula of the whole process is expressed as follows:

Wherein CSA is a channel self-attention operation, SA is a spatial attention operation, and conv is a convolution operation.

The specific formula of the spatial attention operation SA is as follows:

Wherein, F _avg is average pooling, F _max is maximum pooling, F ^7x7 is convolution with convolution kernel size of 7x7, sigma is activation function, which is used to splice the results obtained by convolution operation.

The specific formula of the channel self-attention operation CSA is as follows:

Where F _avg is mean pooling, F _max is maximum pooling, W ₁ and W ₀ are trainable weight matrices, and σ is a Sigmoid activation function.

Through step S3, a first class activation diagram, a second class activation diagram and a third class activation diagram are obtained, the class activation diagram is a special convolutional neural network structure to generate a visual thermodynamic diagram, the pedestrian characteristic information in the visible light characteristic diagram and the thermal infrared characteristic diagram is effectively highlighted, meanwhile, the information is not lost, more details and characteristics in the captured image data are facilitated, and the accuracy of the pedestrian detection result is improved.

And S4, inputting each class of activation graphs into a feature pyramid network to perform multi-scale feature fusion, and generating a fusion feature graph.

The first class activation diagram, the second class activation diagram and the third class activation diagram are input into a feature pyramid network (YOLOXPAFPN), the network comprises four up-sampling processes and four down-sampling processes, three feature diagrams are respectively output after the second, third and fourth down-sampling is finished, the up-sampling convolution kernel size is 3x3, and the step length is 2, so that multi-scale feature fusion is carried out.

The present embodiment selects YOLOXHead detector a detection head that uses a1 x1 convolution to reduce the dimension of the feature map for different channel numbers to a uniform channel number, which helps to unify the dimensions of the feature map. Two parallel branches are then used, each of which includes two 3 x 3 convolution kernels, to perform different detection tasks, including pedestrian prediction bounding box regression and pedestrian prediction bounding box object classification, respectively.

In this embodiment, an OSU-CT visible light thermal infrared dataset with a labeling format of Cvml is used to train a pedestrian detection model based on image fusion, first, missing labeled data is screened, and finally 4125 pairs of datasets are provided in total, then, the pixel sizes of the visible light image and the thermal infrared image in the datasets are uniformly adjusted to 320×240 pixels, and the format is uniformly adjusted to a json format input model. After the model executes the detection tasks, the loss of each detection task is calculated respectively, and the parameters of the pedestrian detection model based on image fusion are updated through a back propagation algorithm.

In a preferred embodiment, the loss of pedestrian prediction bounding box regression detection task is calculated using the IOU loss function as follows:

Wherein, box _gt and box _pre are the actual frame region and the target detection prediction frame region of the target detection frame, respectively.

In a preferred embodiment, the loss of the pedestrian prediction bounding box object classification task is calculated using a cross entropy loss function, the loss weight is 1.0, the regression loss function uses an IOU loss function, its loss weight is 5.0, the learning rate is 0.00001, and the loss rate of the L1 loss function is 1.0.

To verify the performance of the present invention, this example conducted experiments on a public dataset and some of the mainstream methods of pedestrian detection were analyzed and compared at present. The experiments were trained and tested according to the experimental specifications of the corresponding data sets, and the experimental results are shown in table 1.

In Table 1, method 1 uses a visible light single-modality dataset for the Fast-RCnn method, method 2 uses a thermal infrared single-modality dataset for the Fast-RCnn method, and method 3 uses a multi-modality dataset for the YOLOX method. As can be seen from the comparison of the method 1 and the method 2 with the method of the invention, the method of the invention supplements additional information of other modes compared with a single-mode method, and enables the method to have detection capability under different challenging scenes, and as can be seen from the comparison of the method 3 with the method of the invention, the CAM (Class Activation Map) of the method of the invention activates the module, so that the pedestrian characteristic information is effectively highlighted, and meanwhile, the information in each mode is not lost, and a more excellent detection effect is obtained.

TABLE 1 OSU Experimental data on CT dataset

	AP50	AP75	mAP
				Method 1	75.4	36.2	37.8
Method 2	66.3	21.2	30.6
				Method 3	84.2	36.3	41.8
The method of the invention	98.6	59.3	57.6

The foregoing description is only of the preferred embodiments of the invention and is not intended to limit the invention thereto. The invention also comprises a technical scheme which is formed by any combination of the technical characteristics.

The previous description of the embodiments is provided to facilitate a person of ordinary skill in the art in order to make and use the present invention. It will be apparent to those skilled in the art that various modifications can be readily made to these embodiments and the generic principles described herein may be applied to other embodiments without the use of the inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications without departing from the scope of the present invention.

Claims

1. The pedestrian detection method based on image fusion is characterized by comprising the following steps of:

S5, executing detection tasks on the fusion feature map, and outputting pedestrian detection results, wherein the detection tasks comprise pedestrian prediction boundary box regression and pedestrian prediction boundary box object selection classification;

The specific process of step S3 is as follows:

S301, obtaining the visible light characteristic diagram and the thermal infrared characteristic diagram, and performing inner product operation to obtain a first characteristic diagram And then for the first feature map/>Performing a spatial attention operation to obtain a second feature map/>

S302, obtaining the visible light characteristic diagram and the thermal infrared characteristic diagram, and performing addition operation to obtain a third characteristic diagramAnd then for the third feature map/>Performing convolution operation to obtain fourth feature map/>

S303, comparing the second feature mapAnd the fourth feature map/>And performing channel self-attention operation to generate a class activation graph.

2. The pedestrian detection method based on image fusion according to claim 1, wherein in step S1, the preprocessing includes:

unifying the pixel size and format of each image data;

filtering noise reduction and image enhancement are performed.

3. The pedestrian detection method based on image fusion according to claim 1, wherein in step S2, the specific process of multi-scale feature extraction is as follows:

4. A pedestrian detection method based on image fusion according to claim 3, wherein in step S202, the convolution network comprises a convolution layer, a spatial pyramid pooling layer SPP, and a residual block layer.

5. The pedestrian detection method based on image fusion according to claim 1, wherein in step S301, the spatial attention operation is performed as follows:

6. The pedestrian detection method based on image fusion according to claim 1, wherein in step S303, the process of the channel self-attention operation is as follows:

7. The pedestrian detection method based on image fusion of claim 6, wherein the activation function is a Sigmoid activation function.

8. An electronic device comprising a memory, a processor, and a program stored in the memory, wherein the processor implements the method of any of claims 1-7 when executing the program.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.