CN112257569A

CN112257569A - Target detection and identification method based on real-time video stream

Info

Publication number: CN112257569A
Application number: CN202011128268.5A
Authority: CN
Inventors: 段刚; 李衷怡
Original assignee: Qinghai City Cloud Big Data Technology Co ltd
Current assignee: Qinghai City Cloud Big Data Technology Co ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-01-22
Anticipated expiration: 2040-10-21
Also published as: CN112257569B

Abstract

The invention discloses a target detection and identification method based on real-time video stream, which comprises the following steps: s1, making a patrol plan according to the user requirements; s2, controlling the camera to rotate to a specified prefabrication position according to the inspection plan; s3, detecting whether a target exists in the current video frame sequence in the video stream acquired by the camera, executing the step S4 if the target exists, otherwise, delaying to wait for the current video frame detected by the video stream to be acquired continuously, and jumping to the step S2 to control the camera to rotate to the next specified prefabricated position if the target is not found after the delaying is finished; s4, controlling the camera to focus on the target area when finding the target; s5, intercepting the current area image and identifying the current target category; s6, outputting the detection and identification result; and S7, returning to the step S2 to continuously execute the inspection plan, and realizing real-time monitoring and identification of the moving target by the method, wherein the accuracy and the real-time performance are better.

Description

Target detection and identification method based on real-time video stream

Technical Field

The invention relates to the technical field of moving target detection and identification, in particular to a target detection and identification method based on real-time video streaming.

Background

The video-based real-time moving target Detection and identification is widely applied at present, and the moving target Detection (Motion Detection) refers to a process of taking an object with a space position change in an image sequence or a video as a foreground for presentation and marking. The method is always a very popular research field and is widely applied to the fields of intelligent monitoring, multimedia application and the like.

In the years, depending on different application occasions, technical methods and the like, scholars propose various moving object detection methods, which are suitable for the environment and the change of the scholars, and simultaneously have the detection accuracy and the real-time performance. At present, the following basic methods exist for computer vision inspection of moving objects: frame differencing, optical flow, and background subtraction, in addition to feature matching, KNN, and variations of these (three-frame differencing, five-frame differencing). The background subtraction algorithm is widely applied due to the characteristics of simple algorithm, easy realization, good adaptability and the like.

In some scenes with a large camera detection view field range, the imaging area of a moving target in a picture is small, noise interference is large, detection of the moving target is difficult to achieve, particularly, the false detection rate is high in a scene with a fuzzy background, the number of feature points is insufficient, and the type of the target is difficult to identify.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a target detection and identification method based on real-time video stream, which adopts a polling mode to decompose a scene with a larger view field range into video stream to realize effective detection of small targets in a large area;

in order to realize the purpose of the invention, the invention adopts the following technical scheme:

a target detection and identification method based on real-time video stream includes the following steps:

s1, making a patrol plan according to the user requirements;

s2, controlling the camera to rotate to a specified prefabrication position according to the inspection plan;

s3, detecting whether a target exists in the current video frame sequence in the video stream acquired by the camera, executing the step S4 if the target exists, otherwise, delaying to wait for the current video frame detected by the video stream to be acquired continuously, and jumping to the step S2 to control the camera to rotate to the next specified prefabricated position if the target is not found after the delaying is finished;

s4, controlling the camera to focus on the target area when finding the target;

s5, intercepting the current area image and identifying the current target category;

s6, outputting the detection and identification result;

and S7, returning to the step S2 to continuously execute the inspection plan.

Preferably, when step S3 is executed, after the sequence of video frames is obtained from the camera, the first frame is selected as a background frame, meanwhile, a background modeling is performed on the video frames by using median filtering, a background threshold is calculated based on the background frame, and finally, a moving object active region is quickly obtained by using a frame difference method; in step S4, the camera is adjusted in azimuth and focal length according to its relative position. The method integrates the accuracy of background subtraction and the rapid performance of a frame difference method; the purpose is to judge whether the target has an area where the target is quickly obtained.

Preferably, after the current region image is intercepted in step S5, the target class is identified by inputting the deep learning network model.

Preferably, the step of inputting the deep learning network model to identify the target category includes:

sa 1: inputting an image to be processed to a deep learning network model;

sa 2: the deep learning network maps the image to a high-dimensional feature space through initialization convolution:

sa 3: obtaining the feature information of each target in the image through a feature extraction network and a feature enhancement module; the feature extraction network extracts features of different levels, wherein shallow features are beneficial to small target detection, and deep features are beneficial to target identification;

sa 4: and predicting output, and obtaining target category and position information through classification and regression.

Preferably, the initialization convolution in step Sa2 includes convolution a of 3 × 3 × 1, convolution b of 3 × 3 × 2, batch normalization BN, and activation function Relu in this order; firstly, increasing the number of channels of an image to be processed through convolution a, performing downsampling through convolution b of 3 multiplied by 2 to obtain a characteristic diagram, then performing batch normalization BN (boron nitride) processing, and inputting as a next-stage network after an activation function Relu;

preferably, the feature extraction network in step Sa3 is composed of 10-30 residual convolution modules, and each of several residual convolutions is connected with a convolution b of 3 × 3 × 2 for down-sampling; each residual convolution module sequentially comprises 1 multiplied by 1 convolution c, batch normalization BN, activation function Relu, 3 multiplied by 1 convolution a, batch normalization BN and activation function Relu from input to output; the convolution b can change the size of the feature map, so that a feature map of a higher level is obtained, and feature information of the higher level is extracted through cascade of residual convolution.

Preferably, in step Sa3, the feature enhancement module performs fusion with the shallow features through a path aggregation network PAN after passing through spatial pyramid pooling SPP by using the deep features, and mainly functions to improve target detection and recognition accuracy, especially small target detection and recognition, through multi-level feature learning;

preferably, when the output is predicted specifically in step Sa4, the class confidence of the object and the coordinates of the object frame are obtained through the sofmax function; outputting offsets delta x and delta y specifically formed into a target frame, scaling scales a and b of an anchor point, the probability of detecting the target and the confidence coefficient that the target belongs to each category; according to the coordinates of the target frame, marking the position of the target on the original image, and displaying the predicted category confidence; and the probability of detecting the target is used for primarily screening the target frame.

Preferably, during prediction, prediction under three scales is carried out, the feature maps of the last three scales are taken, and after SPP and PAN are carried out, the features of the three scales are input into a detection module to be regressed and classified to obtain an output result; the detection module consists of 3 residual modules plus a convolution c with a fixed number of channels equal to (number of classes +5) × 3.

Preferably, when the probability of detecting the target is greater than a set threshold, wherein the threshold is set to be between 0.4 and 0.5, that is, the probability that the current pixel point belongs to the target to be detected is greater than a set value, the result is retained, the target frame is subjected to non-maximum value inhibition screening and deduplication, the target frame with the intersection ratio being maximum at IoU at the position is determined, and finally the target frame and the confidence thereof are output and displayed as a final result.

Compared with the prior art, the invention has the following beneficial effects:

1) the method realizes the detection and identification of the target in the large area based on the camera, can be matched with the inspection task to realize the real-time monitoring and identification of the moving target under the condition that the vision field range of the camera is large or the target imaging area is small, and has better accuracy and real-time performance.

2) The method adopts a continuous two-stage process, and firstly carries out motion detection on the monitoring video. When the target exists in the current video, the deep learning network detection and identification process of the second stage is started, and the hardware calculation cost is reduced.

3) The motion detection result provides a high-reliability target area for subsequent target detection and identification, so that the subsequent target detection and identification can rely on a lightweight network to realize high-precision result output, and the real-time performance is further improved.

4) The method adopts a frame difference method combined with median filtering to carry out background modeling, and improves the accuracy of motion detection.

5) In the small target detection and identification problem, the method combines the path aggregation network and the spatial pyramid pooling, and reserves enough rich small target characteristic information as much as possible in the convolution process.

Drawings

FIG. 1 is a flow chart of a real-time video stream based object detection and identification method of the present invention;

FIG. 2 is a flow chart of the method for detecting and identifying targets based on real-time video streaming, in which the image input deep learning network model identifies the target category.

FIG. 3 is a block diagram of a feature extraction network in a real-time video stream based target detection and identification method of the present invention;

FIG. 4 is a block diagram of a residual convolution module in a real-time video stream based target detection and identification method according to the present invention;

FIG. 5 is a block diagram of SPPs in a real-time video stream based object detection and recognition method of the present invention;

fig. 6 is a block diagram of a PAN in a real-time video stream based object detection and recognition method of the present invention.

Detailed Description

The technical solution in the embodiment of the present invention will be clearly and completely described below:

as shown in fig. 1, in an embodiment of the present invention, a method for detecting and identifying an object based on a real-time video stream includes the following steps:

s1, making a patrol plan according to the user requirements;

s4, controlling the camera to focus on the target area when finding the target;

s6, outputting the detection and identification result;

and S7, returning to the step S2 to continuously execute the inspection plan.

Specifically, when step S3 is executed, after a video frame sequence is obtained from a camera, a first frame is selected as a background frame, meanwhile, a background modeling is performed on the video frame by using median filtering, a background threshold is calculated based on the background frame, and finally, a moving object active region is quickly obtained by using a frame difference method; in step S4, the camera is adjusted in azimuth and focal length according to its relative position. The method integrates the accuracy of background subtraction and the rapid performance of a frame difference method; the purpose is to judge whether the target has an area where the target is quickly obtained.

Specifically, after the current area image is intercepted in step S5, the target category is identified by inputting the deep learning network model.

As shown in fig. 2, specifically, the step of identifying the target category by the input deep learning network model includes:

sa 1: inputting an image to be processed to a deep learning network model;

Specifically, the initialization convolution in step Sa2 sequentially includes convolution a of 3 × 3 × 1, convolution b of 3 × 3 × 2, batch normalization BN, and an activation function Relu; firstly, increasing the number of channels of an image to be processed through convolution a, performing downsampling through convolution b of 3 multiplied by 2 to obtain a characteristic diagram, then performing batch normalization BN (boron nitride) processing, and inputting as a next-stage network after an activation function Relu;

in specific implementation, the size of an image to be processed is H multiplied by W multiplied by 3 (three channels of length, width and RGB), the number of the channels of the image to be processed is increased to 32 by initialization convolution a, the image to be processed is changed into H multiplied by W multiplied by 32, downsampling is carried out through convolution b of 3 multiplied by 2 to obtain a characteristic diagram of H/2 multiplied by W/2 multiplied by 64, batch normalization BN processing is carried out, and the characteristic diagram is used as the input of a next-stage network after an activation function Relu;

specifically, the feature extraction network in step Sa3 is composed of 10-30 residual convolution (block) modules, and each of the residual convolutions is connected with a convolution b of 3 × 3 × 2 for down-sampling (as shown in fig. 3); each residual convolution module sequentially comprises 1 multiplied by 1 convolution c, batch normalization BN, activation function Relu, 3 multiplied by 1 convolution a, batch normalization BN and activation function Relu from input to output (as shown in FIG. 4); the convolution b can change the size of the feature map, so that a feature map of a higher level is obtained, and feature information of the higher level is extracted through cascade of residual convolution.

Specifically, in step Sa3, the feature enhancement module performs fusion with the shallow features through a path aggregation network PAN after the deep features pass through the spatial pyramid pooling SPP, and mainly functions to improve target detection and recognition accuracy, especially small target detection and recognition, through multi-level feature learning.

The structure of the SPP is shown in fig. 5, and the SPP operation is performed by dividing a feature map into block areas of different sizes, such as 4x4, 2x2, and 1x1 in fig. 5. Maximum pooling is performed for each zone. Therefore, under the three different division modes, one feature map can be represented by 16, 4 and 1 values respectively. After concatenating the values, a feature map can be represented as a vector of 21 values. When 256 feature maps are input, after the SPP operation, 21 × 256-dimensional vectors are obtained. SPP has two main roles: 1. the fixed-length feature vectors can be obtained by inputting with different sizes, so that subsequent full-connection layer operation is facilitated. 2. Multi-scale pooled feature information in a feature map can be fused in a feature vector. While at the same time the computation is less consuming.

The PAN performs feature fusion between feature maps of different levels (as shown in fig. 6). And the size of the feature map is gradually reduced through a down-sampling process. As shown in the following figures: characteristic diagram N of i-th layer_iDirectly obtaining a feature map with the same size as the i +1 th layer through down sampling, and performing feature extraction on the feature map and the i +1 th layer feature output P through a residual error module_i+1And performing serial fusion. To obtain N_i+1A feature map of the layer. Shallow feature information is preserved in the deep feature map as much as possible. The detection rate of the small target is improved. Meanwhile, deep features are subjected to up-sampling and are serially connected and fused with shallow feature output, and the classification precision of the target is favorably improved.

The specific implementation process is as follows, starting from initialization convolution, through feature network extraction, and then to the feature enhancement module:

wherein conv-BN-relu is complete convolution, conv is discrete convolution layer (representing convolution a, b, c process), and is defined as:

after passing through BN layer (batch normalization layer), normalizing the obtained result y to probability distribution with mean value of 0 and variance of 1

Then the final output of the BN layer is obtained through two super parameters

The operation formula is as follows.

The Relu layer is an activation layer with Relu as an activation function. An activation operation of the neuron is performed. Relu is defined as:

specifically, when the output is specifically predicted in step Sa4, the class confidence of the target and the coordinates of the target frame are obtained through the sofmax function; outputting offsets delta x and delta y specifically formed into a target frame, scaling scales a and b of an anchor point, the probability of detecting the target and the confidence coefficient that the target belongs to each category; according to the coordinates of the target frame, marking the position of the target on the original image, and displaying the predicted category confidence; and the probability of detecting the target is used for primarily screening the target frame.

During prediction, prediction under three scales is carried out, the feature maps of the last three scales are taken, and after SPP and PAN are carried out, the features of the three scales are input into a detection module to be regressed and classified to obtain an output result; the detection module is composed of 3 residual modules and a convolution c of fixed channel number (classification category number +5) × 3.

Specifically, when the probability of detecting the target is greater than a set threshold (generally set to be between 0.4 and 0.5), that is, the probability that the current pixel point belongs to the target to be detected is greater than a set value, the result is retained, a Non-Maximum Suppression (Non-Maximum Suppression) is performed on a target frame to screen and remove duplication, the target frame with the intersection ratio greater than IoU at the position is determined, and finally the target frame and the confidence coefficient thereof are output and displayed as a final result;

the above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical scope of the present invention by equivalent replacement or change according to the technical solution and the modified concept of the present invention within the technical scope of the present invention.

Claims

1. A target detection and identification method based on real-time video stream is characterized in that: the method comprises the following steps:

s1, making a patrol plan according to the user requirements;

s4, controlling the camera to focus on the target area when finding the target;

s6, outputting the detection and identification result;

and S7, returning to the step S2 to continuously execute the inspection plan.

2. The method of claim 1, wherein the real-time video stream-based object detection and identification method comprises: when step S3 is executed, after a video frame sequence is obtained from a camera, a first frame is selected as a background frame, and meanwhile, a background modeling is performed on the video frame by using median filtering, and a background threshold is calculated therefrom, and finally, a moving object active region is quickly obtained by using a frame difference method; in step S4, the camera is adjusted in azimuth and focal length according to its relative position.

3. The method of claim 1, wherein the real-time video stream-based object detection and identification method comprises: after intercepting the current area image at step S5, the target category is identified by inputting the deep learning network model.

4. A method for real-time video stream based object detection and recognition as claimed in claim 3, wherein:

the step of identifying the target category by the input deep learning network model comprises the following steps:

sa 1: inputting an image to be processed to a deep learning network model;

sa 2: the deep learning network maps the image to a high-dimensional feature space through initialization convolution;

5. The method of claim 4, wherein the real-time video stream-based object detection and identification method comprises: the initialization convolution in step Sa2 sequentially includes convolution a of 3 × 3 × 1, convolution b of 3 × 3 × 2, batch normalization BN, and an activation function Relu; the image to be processed is firstly subjected to convolution a to increase the number of channels, is subjected to downsampling through convolution b of 3 multiplied by 2 to obtain a feature map, is subjected to batch normalization BN (boron nitride) processing, and is input as a next-stage network after being subjected to an activation function Relu.

6. The method of claim 4, wherein the real-time video stream-based object detection and identification method comprises: in the step Sa3, the feature extraction network is composed of 10-30 residual convolution modules, and each residual convolution module is connected with a convolution b of 3 × 3 × 2 for downsampling; each residual convolution module sequentially comprises 1 multiplied by 1 convolution c, batch normalization BN, activation function Relu, 3 multiplied by 1 convolution a, batch normalization BN and activation function Relu from input to output; the convolution b can change the size of the feature map, so that a feature map of a higher level is obtained, and feature information of the higher level is extracted through cascade of residual convolution.

7. The method of claim 4, wherein the real-time video stream-based object detection and identification method comprises: in step Sa3, the feature enhancement module performs fusion with the shallow features via a path aggregation network PAN after the deep features pass through spatial pyramid pooling SPP, and mainly functions to improve target detection and recognition accuracy, especially small target detection and recognition, through multi-level feature learning.

8. The method of claim 4, wherein the real-time video stream-based object detection and identification method comprises: when output is predicted specifically in the step Sa4, obtaining a category confidence of the target and coordinates of the target frame through a sofmax function; outputting offsets delta x and delta y specifically formed into a target frame, scaling scales a and b of an anchor point, the probability of detecting the target and the confidence coefficient that the target belongs to each category; according to the coordinates of the target frame, marking the position of the target on the original image, and displaying the predicted category confidence; and the probability of detecting the target is used for primarily screening the target frame.

9. The method of claim 8, wherein the real-time video stream-based object detection and identification method comprises: during prediction, prediction under three scales is carried out, the feature maps of the last three scales are taken, and after SPP and PAN are carried out, the features of the three scales are input into a detection module to be regressed and classified to obtain an output result; the detection module consists of 3 residual modules plus a convolution c with a fixed number of channels equal to (number of classes +5) × 3.

10. The method of claim 8, wherein the real-time video stream-based object detection and identification method comprises: and when the probability of detecting the target is greater than a set threshold, wherein the threshold is set to be 0.4-0.5, namely the probability that the current pixel point belongs to the target to be detected is greater than a set value, the result is retained, the target frame is subjected to non-maximum value inhibition screening and duplicate removal, the target frame with the intersection ratio being greater than IoU at the position is determined, and finally the target frame and the confidence thereof are output and displayed as a final result.