CN111079516B

CN111079516B - Pedestrian gait segmentation method based on deep neural network

Info

Publication number: CN111079516B
Application number: CN201911050215.3A
Authority: CN
Inventors: 王慧燕; 雷蕾
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2022-12-20
Anticipated expiration: 2039-10-31
Also published as: CN111079516A

Abstract

The invention provides a pedestrian gait segmentation method based on a deep neural network, aiming at the problems that the O-shaped shape between two legs is difficult to segment and the leg shape segmentation is not fine enough during pedestrian gait segmentation. The method realizes the fine segmentation of the gait of the pedestrian by two steps of designing a cavity convolution residual convolution network and adding an edge detector branch; the method has the advantages that the receptive field of the shallow network is improved by replacing the common convolution of the resnet in the last stage with the cavity convolution, characteristics of more information are obtained and transmitted to the next stage, and finally the obtained mask is input into an edge detector composed of edge detection operators, so that the problem that the gait edge in the gait of the pedestrian is not fitted is well solved, the more accurate gait edge of the pedestrian is obtained, and the fineness of leg segmentation is improved.

Description

Pedestrian gait segmentation method based on deep neural network

Technical Field

The invention relates to the technical field of image processing and pattern recognition in computer vision, in particular to a pedestrian gait segmentation method based on a deep neural network.

Background

In recent years, video surveillance is widely applied to the fields of traffic, military, urban construction, safety and the like, and the importance of the video surveillance is becoming more and more non-negligible.

Gait segmentation of pedestrians is an indispensable part of video monitoring technology. The pedestrian region is extracted from the image video of the pedestrian gait, which is an important link of the pedestrian gait recognition and is one of the most rigorous computer vision tasks.

Currently, there is less research on pedestrian gait segmentation, while the study on example segmentation is relatively more sophisticated. Instance segmentation is a basic computer vision technique, which is a key step from image processing to image analysis, is the first step of performing image analysis, and is one of the most demanding computer vision tasks, which involve object localization and segmentation of object instances. In recent years, a large number of example segmentation papers are published, and a large number of example segmentation methods are provided, so that a good technical basis is provided for pedestrian gait segmentation.

Disclosure of Invention

The invention aims to provide a pedestrian gait segmentation method based on a deep neural network.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a pedestrian gait segmentation method based on a deep neural network is characterized by comprising the following steps:

s1) predicting gait boundaries of pedestrians

Predicting gait boundaries of 1 or more pedestrians in a picture or video given the picture or video;

for the pictures, detecting targets of all pedestrians in a single picture, and carrying out gait segmentation on the targets;

for a video, inputting each frame, detecting targets of all pedestrians in each frame of the video, carrying out gait segmentation on the targets, outputting each processed frame and combining the processed frames to form a segmented pedestrian gait video;

s2) image preprocessing and label making

Uniformly adjusting the size of the segmented pedestrian gait images to h x w, wherein h is the height of the images, and w is the width of the images;

making a label, carrying out pixel value processing on the targets at the same positions of the image, adopting pixels with the pixel values of 14 to delineate the positions of the pedestrians, and uniformly setting the pixel values of 0 at the positions of the non-pedestrians to represent the background;

s3) constructing a gait segmentation deep convolution neural network

S3-1) extracting characteristics by adopting a basic network

A resnet50 network is used as a basic network, and on the resnet50 network structure, the hole convolution with the hole rate of 2 is used for replacing the common convolution of the last stage of the resnet 50;

s3-2) inputting the image preprocessed in the step S2) into the basic network in the step S3-1), inputting the image into the FPN after passing through the basic network to further extract the feature of each dimension, and effectively generating a multi-dimension feature expression method for the image by utilizing feature expression structures of different dimensions of the same scale image of each layer from bottom to top of the FPN;

s3-3) generating ROI features with the size of 14 × 256 by the features extracted in the step S3-2) through ROIAlign, generating candidate frame region pro-position mapping through ROIAlign to generate feature maps with fixed size, and obtaining more accurate pedestrian candidate frames by adopting a bilinear interpolation method;

s3-4) converting the feature map with the size of 14 × 256 in the step S3-3) into the pedestrian P _ mask with the size of 28 × 1 through 5 convolutions and deconvolution;

s3-5) performing a kernel size of 2 and a stride of 2 on the 28 × 1P _ mask obtained in step S3-4) to make the predicted mask have the same spatial size as the output of step S3-3), and combining the predicted mask with the output of step S3-3) to obtain a 14 × 257 size feature map;

the characteristic diagram passes through 4 convolutional layers, and the kernel size and the number of filters of the 4 convolutional layers are respectively set to be 3 and 256; adding 3 full convolution layers, wherein the first two full convolution layers are set to be 1024, the last full convolution layer is set to be the number of categories, the number of the categories is 1, namely pedestrians; the output value is the score of the mask, the threshold value is set to be 0.5, and the mask with the threshold value larger than 0.5 is adopted and is defined as GT _ mask;

s4) constructing a loss function by using a Binary Cross Entropy loss function Binary _ Cross _ Encopy, and expressing the real probability as

The prediction probability is expressed as

Where y represents the probability that the sample belongs to a pedestrian, 1-y represents the probability that the sample belongs to the background,

which represents the probability of predicting a pedestrian,

representing the probability of predicting the background, the similarity between p and q is measured by cross entropy, and the formula is as follows:

（1）；

s5) comparing the information of each pixel point in the GT _ mask and the P _ mask by using a Binary Cross Entropy loss function Binary _ Cross _ Encopy;

s6) inputting the P _ mask and the GT _ mask obtained in the step S3) into an edge detector, wherein the edge detector is composed of one edge detection operator with the size of 3 × 1, the two masks and the edge detection operator carry out convolution to obtain the edges of the two masks, and the edge result obtained after the P _ mask is input is defined as the edge result

For the edge result obtained after GT _ mask input, it is defined as

；

S7) the product obtained in the step S6)

And

the loss function loss is constructed, the formula is as follows:

（2）；

compared with the prior art, the invention has the following advantages:

the invention provides a pedestrian gait segmentation method based on a deep neural network, aiming at the situation that O-shaped legs exist in pedestrian gait and the leg shape is difficult to draw. The method realizes the fine segmentation of the gait of the pedestrian by two steps of designing a cavity convolution residual convolution network and adding an edge detector branch; the cavity convolution is used for replacing the common convolution of the last stage of the resnet to improve the receptive field of the shallow network, the characteristics of more information are obtained and transmitted to the next stage, and the finally obtained mask is input into an edge detector consisting of edge detection operators, so that the problem that the gait edge in the gait of the pedestrian is not fitted is well solved, and the gait edge of the pedestrian is more accurate.

Detailed Description

s1) predicting gait boundaries of pedestrians

and for the video, inputting each frame, detecting the targets of all pedestrians in each frame of the video, carrying out gait segmentation on the targets, outputting each processed frame and combining the processed frames into a segmented pedestrian gait video.

S2) image preprocessing and label making

and (3) making a label, performing pixel value processing on the target at the same position of the image, drawing the position of the pedestrian by adopting a pixel with a pixel value of 14, and uniformly setting the pixel value of the non-pedestrian position to be 0 to represent the background.

S3) constructing a gait segmentation deep convolution neural network

S3-1) extracting characteristics by adopting a basic network

A resnet50 network is used as a basic network, and on the basis of a resnet50 network structure, the common convolution of the resnet50 in the last stage is replaced by the hole convolution with the hole rate of 2; the receptive field of the network is enlarged, and the subsequent characteristic extraction of a deep network is facilitated; wherein, the resnet50 network is a 50-layer residual convolution network, namely deep residual network;

s3-2) inputting the image preprocessed in the step S2) into the basic network of the step S3-1), inputting the image into an FPN (field programmable gate array) after the image passes through the basic network to further extract the feature of each dimension, wherein the FPN is an efficient CNN feature extraction method, and a method for effectively generating multi-dimension feature expression of a picture by using feature expression structures of different dimensions of the same scale picture of each layer from bottom to top of the FPN is utilized, so that a feature map with stronger expression force is generated to be used for a computer vision task of the next stage;

s3-3) generating 14 × 256 ROI features extracted in the step S3-2) through ROIAlign, generating candidate frame region pro-positive mapping through ROIAlign to generate fixed-size feature map, and obtaining more accurate pedestrian candidate frames by adopting a bilinear interpolation method; ROIAlign is a regional feature aggregation mode proposed in Kaiming He, et al, mask R-CNN, ICCV 2017;

s3-5) performing a max posing layer with kernel size of 2 and stride of 2 on the P _ mask with the size of 28 × 1 obtained in the step S3-4) to enable the predicted mask to have the same space size as the output in the step S3-3), and combining the predicted mask with the output in the step S3-3) to obtain a feature map with the size of 14 × 257;

the characteristic diagram passes through 4 convolutional layers, and the kernel size and the number of filters of the 4 convolutional layers are respectively set to be 3 and 256; adding 3 full convolution layers, wherein the first two full convolution layers are set to be 1024, the last full convolution layer is set to be the number of categories, the number of the categories is 1, namely pedestrians; the output value is the score of the mask, a threshold value is set to be 0.5, and the GT _ mask is defined by adopting the mask with the threshold value being more than 0.5.

S4) constructing a loss function by using a Binary-classification Cross Entropy loss function Binary _ Cross _ Encopy, and expressing the real probability as

The prediction probability is expressed as

Wherein y representsThe probability that the sample belongs to a pedestrian, 1-y represents the probability that the sample belongs to the background,

which represents the probability of predicting a pedestrian,

（1）。

s5) comparing the information of each pixel point in the GT _ mask and the P _ mask by using a Binary Cross Entropy loss function Binary _ Cross _ Encopy.

S6) inputting the P _ mask and the GT _ mask obtained in the step S3) into an edge detector, wherein the edge detector is composed of one edge detection operator with the size of 3 × 1, the two masks and the edge detection operator carry out convolution to obtain the edges of the two masks, and the edge result obtained after the P _ mask is input is defined as the edge result obtained after the P _ mask is input

For the edge result obtained after GT _ mask input, it is defined as

。

S7) the product obtained in the step S6)

And

the loss function loss is constructed, the formula is as follows:

（2）；

the gait edge fitting degree of the pedestrian after passing through the edge detector is greatly improved, and the gap contour between the two legs can be detected.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the spirit of the present invention, and these modifications and variations should also be considered within the scope of the present invention.

Claims

1. A pedestrian gait segmentation method based on a deep neural network is characterized by comprising the following steps:

s1) predicting gait boundaries of pedestrians

inputting each frame of a video, detecting targets of all pedestrians in each frame of the video, carrying out gait segmentation on the targets, outputting each processed frame and combining the processed frames into a segmented pedestrian gait video;

s2) image preprocessing and label making

making a label, carrying out pixel value processing on targets at the same positions of the image, drawing edges of the positions of pedestrians by adopting pixels with the pixel value of 14, and uniformly setting the pixel values of non-pedestrian positions to be 0 to represent the background;

s3) constructing a gait segmentation depth convolution neural network

S3-1) extracting characteristics by adopting a basic network

A resnet50 network is used as a basic network, and on the basis of a resnet50 network structure, the common convolution of the resnet50 in the last stage is replaced by the hole convolution with the hole rate of 2;

s3-3) generating 14 × 256 ROI features extracted in the step S3-2) through ROIAlign, generating candidate frame region pro-positive mapping through ROIAlign to generate fixed-size feature map, and obtaining more accurate pedestrian candidate frames by adopting a bilinear interpolation method;

s3-4) carrying out 5 convolutions on the feature map with the size of 14 × 256 in the step S3-3), and then carrying out deconvolution on the feature map to transform the feature map into the pedestrian P _ mask with the size of 28 × 1;

the characteristic diagram passes through 4 convolution layers, and the kernel size and the number of filters of the 4 convolution layers are respectively set to be 3 and 256; adding 3 full convolution layers, wherein the first two full convolution layers are set to be 1024, the last full convolution layer is set to be the number of categories, the number of the categories is 1, namely pedestrians; the output value is the score of the mask, a threshold value is set to be 0.5, and the mask with the threshold value larger than 0.5 is adopted and is defined as GT _ mask;

s4) constructing a loss function by using a Binary Cross Entropy loss function Binary _ Cross _ Encopy, expressing the real probability as p e { y,1-y }, and expressing the prediction probability as q e as a great face