CN103605983A

CN103605983A - Remnant detection and tracking method

Info

Publication number: CN103605983A
Application number: CN201310531106.XA
Authority: CN
Inventors: 苏育挺; 刘安安; 马莉
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2013-10-30
Filing date: 2013-10-30
Publication date: 2014-02-26
Anticipated expiration: 2033-10-30
Also published as: CN103605983B

Abstract

The invention discloses a remnant detection and tracking method. The method comprises the following steps of carrying out pretreatment of graying, filtering and the like on an original monitoring video image sequence so as to acquire an initial video image sequence; carrying out background modeling on the initial video image sequence collected by a camera, extracting a foreground area through an obtained background modeling result, carrying out de-noising processing on the foreground area so as to acquire a foreground target; through a positive example picture and a negative example picture, carrying out offline training on a support vector machine so as to acquire a target remnant model and a human body model respectively, inputting each foreground target into the two models respectively so as to carry out determination, and outputting a target remnant; using a Meanshift algorithm to track the target remnant and acquiring position coordinates of the target remnant in each previous frame; reversely traversing the initial video image sequence and tracking the position coordinates of the target remnant in each frame before the current frame, carrying out statistics and analysis, and outputting an image containing the target remnant in the initial video image sequence.

Description

Method for detecting and tracking abandoned object

Technical Field

The invention relates to the field of carry-over detection, in particular to a carry-over detection and tracking method.

Background

Carryover refers to an unattended static object. In intelligent video surveillance systems, detection of an object left behind has wide application in many fields, such as: real-time monitoring suspicious objects or lost luggage and the like in open places such as buildings, squares, military control areas and the like^[1]. The monitoring video has important significance for obtaining evidence after abnormal conditions occur, and once abnormal conditions occur in the places, monitoring personnel cannot find the abnormal conditions in time, and the conditions at that time can be inquired only by replaying the video. However, the monitoring mode mainly based on human precaution and video playback afterwards not only wastes a lot of manpower and material resources, but also often causes the report omission of abnormal events, and cannot solve the safety problem widely existing at present.

In an intelligent video surveillance system, the detection of the carry-over is based on digital image processing, digital video processing, computer vision, pattern recognition and other technologies, and the data in the surveillance video is analyzed by means of the computer processing technology. The method has the advantages that the remnants in the public place can be automatically detected and tracked, when an emergency happens, security personnel can carry out targeted inspection according to alarm information triggered by the intelligent video monitoring system in real time, and accordingly the security work can achieve accurate target positioning, high speed and strong pertinence.

Most of the existing detection methods for the left objects determine that the objects with long enough stationary time in the foreground area are left objects, and do not consider whether the foreground area is a stationary pedestrian or an object; meanwhile, the relationship between people and objects cannot be considered, and false detection is easily caused.

Therefore, how to further distinguish whether the foreground region is a human body or an object, and judge the relationship between the human body and the object, and reducing the false detection rate of the remnant is an urgent problem to be solved in the detection of the remnant.

Disclosure of Invention

The invention provides a method for detecting and tracking a remnant, which reduces the false detection rate of the remnant by reducing the false report of an abnormal event, and is described in detail in the following description:

a carryover detection and tracking method, the method comprising the steps of:

(1) carrying out preprocessing such as graying, filtering and the like on an original monitoring video image sequence to obtain an initial video image sequence;

(2) carrying out background modeling on an initial video image sequence acquired by a camera, extracting a foreground region according to an obtained background modeling result, carrying out denoising processing on the foreground region, and obtaining a foreground target F_q；

(3) Performing off-line training on the support vector machine through positive example pictures and negative example pictures to respectively obtain a target left object model M and a human body model N, and enabling each foreground target F obtained in the step (2) to be_qRespectively inputting into two models for determination and outputtingThe target remaining object P is output;

(4) tracking the target legacy object P by adopting a Meanshift algorithm, and acquiring the position coordinate of the target legacy object P in each previous frame;

(5) and traversing the initial video image sequence reversely, tracking the position coordinates of the target object P in each frame before the current frame, performing statistical analysis, and outputting an image containing the target object P in the initial video image sequence.

Each foreground target F obtained in the step (2)_qThe two models are respectively input for judgment, and the operation of outputting the target object P is specifically as follows:

1) if the output of the human body model N is 1 and the output of the target object model M is 1, determining that the target type object in the scene is watched by people and is not a left object, and not executing the step (4);

2) if the output of the human body model N is 1 and the output of the target object model M is 0, determining that no target type object needs to be found in the scene, and not executing the step (4);

3) if the output of the human body model N is 0 and the output of the target object model M is 0, determining that no target type object needs to be found in the scene, and not executing the step (4);

4) if the output of the human body model N is 0 and the output of the target object model M is 1, the target type object in the scene is determined to be unattended, the target object P is determined to be the target object P, and the next step (4) is carried out.

The operation of tracking the target legacy object P by using the Meanshift algorithm and acquiring the position coordinates of each frame before the target legacy object P is specifically as follows:

1) color space conversion;

2) sampling the hue value of each pixel in the circumscribed rectangle of the target legacy object P to obtain a color histogram of the hue value;

according to the formula

Obtaining a back projection t (n) of the circumscribed rectangle; n represents the horizontal axis coordinate of the color histogram and represents a pixel with the value of n in the image area; h (n) represents the vertical axis of the color histogram, which is the statistics of the number of pixels with the value of n;

3) calculating the position coordinates of the target legacy object P in the previous frame;

in the k-th frame, the coordinates of all pixels within the bounding rectangle are multiplied by the back projection t_kThe pixel values of the corresponding points are added and summed to obtain the coordinate (x) of the last frame of the target object P₀,y₀)；

Wherein x and y are coordinate values of all pixels in the external rectangle in the HSV space on the x axis and the y axis; t is t_k(n_x,y) Is a reverse projection t under coordinates (x, y)_kA pixel value of (a);

calculating the distance between the results of two iterations

If d is less than or equal to T₂The iteration is ended, and the position coordinate of the target legacy object P in the last frame is (x)₀,y₀)。

The backward traversal of the initial video image sequence, tracking the position coordinates of the target object P in each frame before the current frame, performing statistical analysis, and outputting the image containing the target object P in the initial video image sequence specifically comprises:

traversing the initial video image sequence forwards and backwards from the moment of finding the target leaving the object P, and tracking and recording the position coordinates of the initial video image sequence;

1) if m is more than or equal to 1 and less than or equal to k in the mth frame, the condition of

p_m(x)≤T₃Or X-p_m(x)≤T₃

Take m' = m_minIf m' represents the minimum frame number meeting the above condition, the target legacy object P is considered to enter the scene for the first time at the mth frame; p is a radical of_m(x) Representing the X coordinate of the target legacy object P at the mth frame, X representing the width pixel value of the initial video image sequence;

outputting an image frame containing the target legacy object P in the initial video image sequence every A frames from the m' th frame, namely outputting a set O of image frames:

wherein,representative pair

Carrying out upward rounding;

2) otherwise, an image frame containing the target carry-over object P is output every a frames from the 1 st frame of the initial video image sequence. I.e. the set O of output image frames:

the technical scheme provided by the invention has the beneficial effects that: the method comprises the steps of obtaining a foreground target through a background difference method, judging a target object to be left by using an offline-trained support vector machine model, obtaining position coordinates of each frame before a current frame, finding the moment when the target object to be left enters a scene through statistical analysis, and outputting an image frame to remind security personnel of paying attention; the method reduces the missing report of the abnormal events, reduces the false detection rate of the remnants, and improves the working efficiency of the monitoring equipment and the security personnel.

Drawings

FIG. 1 is a flow chart of a sequential logic based carryover detection method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

In order to reduce the computation complexity and the computation amount, implement real-time detection of the carry-over, and reduce the false detection rate of the carry-over, an embodiment of the present invention provides a carry-over detection method based on sequential logic, and refer to fig. 1, which is described in detail below:

101: carrying out preprocessing such as graying, filtering and the like on an original monitoring video image sequence to obtain an initial video image sequence;

the embodiment of the invention firstly carries out graying processing on an original monitoring video image sequence, and then adopts a Gaussian filtering method in reference document [2] for further processing to obtain an initial video image sequence.

102: carrying out background modeling on an initial video image sequence acquired by a camera, extracting a foreground region according to an obtained background modeling result, carrying out denoising processing on the foreground region, and obtaining a foreground target F_q；

Since HOG (histogram of gradient directions) detection is performed by traversal on each frame of image, a foreground region is first extracted from each frame of the surveillance video for this purpose. The background modeling operation for the video content and the operation for extracting the foreground region according to the obtained background modeling result can adopt the most common methods such as a background difference method, an optical flow method, an inter-frame difference method and the like. The interframe difference method [3] generally can not completely extract all related pixel points, and void phenomena are easily generated inside a moving entity. The optical flow method [4] is relatively complex in calculation method and poor in noise resistance. Because the environment of the object to be researched is ideal, and the requirement of the algorithm used by the method on background updating is not very high (the background needs to be updated once in a long time), the foreground object is extracted by adopting a background difference method. The method is simple to implement, high in operation speed, and suitable for the static scene of the camera, and a static background image of the current scene needs to be obtained. The method comprises the following specific steps:

1) obtaining a static background image b which does not contain a target object in a current scene;

2) the current frame (i.e. the k frame) image f_kDifference operation is carried out on (x, y) and the background image b to obtain a difference image D_k(x,y)，

D_k(x,y)=|f_k(x,y)-b|。

3) For the difference image D_k(x, y) binarizing to obtain a binarized image R_k(x,y)，Wherein the threshold value T₁Can be set according to actual conditions, and T is set in the experiment₁And = 25. In particular, the embodiment of the present invention is not limited to this.

4) For the phenomenon of occurrence of voids and burrs in the foreground region, reference is used [5 ]]The morphological method proposed in (1) for binarized image R_k(x, y) processing to eliminate noise interference.

Namely, for the binary image R_kAnd (x, y) performing morphological filtering processing to eliminate isolated noise points and repair the holes in the target area. And finally detecting and segmenting a foreground target F through connectivity analysis_q(Q =1,2.. Q, where Q represents the total number of segmented foreground target objects), extracting the foreground target F_qIs externally connected with a rectangle U_qAnd will be externally connected with a rectangle U_qThe size of the picture is uniformly changed to 64 × 128 pixels, so that feature vectors can be extracted later.

103: performing off-line training on the support vector machine through the positive example picture and the negative example picture, respectively obtaining a target left object model M and a human body model N, and performing off-line training on each foreground target F obtained in the step 102_qRespectively inputting the two models for judgment, and outputting a target remaining object P;

in order to determine the foreground object F extracted in step 102_qWhether the object belongs to a human body or a target type left object (in the method, the target object is described by taking a suitcase, a traveling bag, a backpack or a case as an example, and can be set as other types of objects in specific implementation, which is not limited by the embodiment of the invention), a parcel type picture N is selected from a network picture and a personal shot picture₁Selecting the part without package in the monitoring video as a negative example picture, wherein the negative example picture is N₂A web; similarly, a human body picture M is selected from a network picture and a personal shot picture₁Selecting the part without human body in the monitoring video as another negative example picture, wherein the negative example picture is M in total₂A web; the sizes of the pictures are unified into 64 × 128 pixels, and the HOG features of the positive example picture and the negative example picture are respectively extracted.

The HOG feature is a local region descriptor, is formed by calculating a gradient direction histogram on a local region, is insensitive to illumination change and small offset, and is extracted by the following specific method:

1) gradient calculation:

assuming that the gray value at the pixel point (x, y) in the input image is I (x, y), the corresponding gradients in the x and y directions can be represented by the following formula:

G_x(x,y)=|I(x+1,y)-I(x-1,y)|

G_y(x,y)=|I(x,y+1)-I(x,y-1)|

wherein G is_x(x,y)、G_y(x, y) respectively representing the horizontal direction gradient and the vertical direction gradient of the pixel point (x, y) in the input image; i (x +1, y) represents the gray value at the pixel point (x +1, y); i (x-1, y) represents the gray value at the pixel point (x-1, y); i (x, y +1) represents the gray value at the pixel point (x, y + 1); i (x, y-1) represents the gray value at pixel point (x, y-1).

The gradient magnitude G (x, y) and the gradient direction α (x, y) at the pixel point (x, y) can be represented by the following equations:

G (x, y) = \sqrt{G_{x} {(x, y)}^{2} + G_{y} {(x, y)}^{2}}

2) will be provided with

The interval is evenly divided into 9 intervals, a histogram channel is established in each interval, the input image is divided into a plurality of units (cells) with 8 x 8 pixels, the amplitudes of all pixel points in the units (cells) are accumulated into the corresponding histogram channels according to the gradient direction values, and a 9-dimensional feature vector is obtained;

3) combining every adjacent 4 cells (cells) into a block (block), combining the feature vectors of the 4 cells (cells) to obtain 36-dimensional feature vectors of the block (block), and respectively normalizing in each block (block), wherein the normalization rule is as follows:

wherein v is a vector to be normalized, | v | | purple calculation₁Is a norm of v, epsilon is a small constant, and epsilon =0.04 is set in this experiment. The value of epsilon can be determined according to actual conditions, and in the specific implementation, the embodiment of the present invention does not limit this.

4) The image is scanned in blocks with a scanning step of one cell. And finally, connecting the normalized features of all the blocks in series to obtain an HOG feature vector of the input image, and obtaining a target left object model M and a human body model N.

The target left-behind object model M and the human body model N are trained offline by adopting a support vector machine proposed in reference [6], namely the HOG feature vector of each positive example picture is marked as 1, the HOG feature vector of each negative example picture is marked as 0, and the HOG feature vectors are input into the support vector machine for training to obtain the target left-behind object model M and the human body model N.

The number of positive example pictures and negative example pictures is set according to the requirements in practical application, when the number is large, the trained support vector machine has high precision but consumes long time, and in the specific implementation, the embodiment of the invention does not limit the precision. In the experiment, N is selected₁=M₁=500，N₂=M₂=800。

The foreground object F_qIs externally connected with a rectangle U_qThe determination is performed by sequentially inputting the results into the target left-behind object model M and the human body model N (in the experiment, only the case where there is one target object in the scene is considered, and if there are a plurality of target objects, the following determination is performed in sequence).

1) If the output of the human body model N is 1 and the output of the target object model M is 1, the target type object in the scene is determined to be watched by people and not to be a left object, and the step 104 is not executed any more.

2) If the output of the human body model N is 1 and the output of the target object model M is 0, it is determined that there is no target type object to be found in the scene, and the step 104 is not executed.

3) If the output of the human body model N is 0 and the output of the target object model M is 0, it is determined that there is no object of the target type to be found in the scene, and the step 104 is not executed.

4) If the output of the human body model N is 0 and the output of the target object model M is 1, the target type object in the scene is determined to be unattended and is the target object P, and the next step 104 is carried out.

104: tracking the target remaining object P obtained in the step 103 by using a Meanshift algorithm, and acquiring the position coordinates of the target remaining object P in each previous frame;

the Meanshift algorithm is a density gradient-based non-parameter estimation method, and the basic principle of the method is an iterative process. After the target object P is detected in step 103, iteratively calculating the position coordinates of the target object P in the previous image frame by using a Meanshift algorithm until the coordinate distance calculated by two iterations is less than a certain threshold or the iteration exceeds a certain number of times, specifically comprising the following steps:

1) color space conversion:

setting coordinates (x ', y') for storing the result of the last iteration, and initializing the coordinates into the position coordinates of the target legacy object P; if the iteration number L =0, the maximum number of times of L is set to 8 in the method, and in a specific implementation, this is not limited in the embodiment of the present invention.

In order to eliminate the influence of illumination variation and shadow, the input video image in RGB color space is converted into HSV color space, wherein hue H contains the most essential information of color and is independent of brightness.

The formula for converting the RGB color space image into the HSV color space image is as follows:

V=max(R,G,B)

H=H+360,if H<0

wherein H represents hue, S represents saturation, and V represents brightness; r, G, B denote red, green and blue pixels, respectively. And converting each pixel point in the external rectangle of the target legacy object P from the RGB color space to the HSV color space by the formula.

2) And sampling the hue value of each pixel in the circumscribed rectangle of the target legacy object P to obtain a color histogram of the pixel. According to the formula

The back projection t (n) of the circumscribed rectangle is obtained. Wherein n represents the horizontal axis coordinate of the color histogram and represents a pixel with a value of n in the image area; h (n) represents the vertical axis of the color histogram, which is the statistic for the number of pixels with value n.

3) Calculating the position coordinates of the target legacy object P in the previous frame:

in the k-th frame of the video sequence,multiplying the coordinates of all pixels in the circumscribed rectangle by the back projection t_kThe pixel values of the corresponding points are added and summed to obtain the coordinate (x) of the last frame of the target object P₀,y₀)。

And x and y are coordinate values of all pixels in the external rectangle in the HSV space on the x axis and the y axis. t is t_k(n_x,y) Is a reverse projection t under coordinates (x, y)_kThe pixel value of (2).

Calculating the distance between the results of two iterations

If d is less than or equal to T₂The iteration ends (threshold T)₂The determination may be based on the actual situation, for example: t is₂=3, which is not limited in this embodiment of the present invention when implementing specifically), the position coordinate of the target left-behind object P in the last frame is (x)₀,y₀). Otherwise (i.e. d)>T₂) Updating the coordinates (x ', y ') to (x ') of the center position of the circumscribed rectangle₀,y₀) Obtaining a new external rectangle, increasing the iteration number L = L +1, returning to the step 2) to calculate the reverse projection diagram, calculating again according to the step 3), and repeating the operation; the method sets the termination condition that the iteration times are more than or equal to 8, namely when the iteration times are more than or equal to 8, the loop is jumped out, and the process is ended.

105: and traversing the initial video image sequence reversely, tracking the position coordinates of the target object P in each frame before the current frame, performing statistical analysis, outputting an image containing the target object P in the initial video image sequence, and reminding security personnel of paying attention.

And traversing the initial video image sequence from the moment (k frame) when the target object P is found to be left to the front and back, and tracking and recording the position coordinates of the initial video image sequence.

1) If it is satisfied at the m-th frame (1. ltoreq. m. ltoreq.k)

p_m(x)≤T₃Or X-p_m(x)≤T₃

Take m' = m_minAnd m' represents the minimum frame number satisfying the above condition. The target legacy object P at the m' th frame is considered to enter the scene for the first time. Wherein p is_m(x) Representing the X-coordinate of the target legacy object P at frame m, X representing the width pixel value of the initial video image sequence. T is set in this experiment₃= 10. Threshold value T₃The method and the device for processing the image data can be determined according to actual conditions, and when the method and the device are specifically implemented, the method and the device for processing the image data are not limited to this.

And outputting the image frames containing the target legacy object P in the initial video image sequence from the m' th frame to every A frames. I.e. the set O of output image frames:

wherein,representative pairAnd rounding up. In this experiment, a =15 is set, a value of a may be determined according to an actual situation, and in a specific implementation, this is not limited in the embodiment of the present invention.

reference to the literature

[1]K.Smith,P.Quelhas,D.Gatica-Perez.Detecting abandoned luggage items in a publicspace[C].Proceedings of the9th IEEE International Workshop on Performance Evaluation inTracking and Surveillance(PETS'06),2006:75～82.

[2]Lin,H.C.,Wang,L.L.,&Yang,S.N.(1996).Automatic determination of the spread parameterin Gaussian smoothing.Pattern Recognition Letters,17(12),1247-1252.

[3]Abdi J,Nekoui M A.Determined prediction of nonlinear time series via emotional temporaldifference learning[C].Control and Decision Conference,2008.

[4]Ahmad M,Taslima T,Lata L,et al.A combined local-global optical flow approach for cranialultrasonogram image sequence analysis.11th International Conference on Computer andInformation Technology,2008.

[5]Comer,Mary L.,and Edward J.Delp.Morphological operations for color image processing.Journal of electronic imaging8.3(1999):279-289.

[6]Zhang,Hao,et al.SVM-KNN:Discriminative nearest neighbor classification for visualcategory recognition.Computer Vision and Pattern Recognition,2006IEEE Computer SocietyConference on.Vol.2.IEEE,2006.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A carryover detection and tracking method, the method comprising the steps of:

(3) By positive and negative example picturesOff-line training the support vector machine by the patch pair, respectively obtaining a target left object model M and a human body model N, and obtaining each foreground target F obtained in the step (2)_qRespectively inputting the two models for judgment, and outputting a target remaining object P;

2. The carry-over detection and tracking method according to claim 1, wherein each foreground object F obtained in the step (2) is used as a reference_qThe two models are respectively input for judgment, and the operation of outputting the target object P is specifically as follows:

3. The method according to claim 1, wherein the operation of tracking the target object P by using the Meanshift algorithm and obtaining the position coordinates of each frame before is specifically:

1) color space conversion;

according to the formula

Wherein x and y are coordinate values of all pixels in the external rectangle in the HSV space on the x axis and the y axis;t_k(n_x,y) Is a reverse projection t under coordinates (x, y)_kA pixel value of (a);

calculating the distance between the results of two iterations

4. The method according to claim 1, wherein the backward traversal of the initial video image sequence tracks the position coordinates of the target object P in each frame before the current frame, performs statistical analysis, and outputs the image of the initial video image sequence containing the target object P specifically comprises:

p_m(x)≤T₃Or X-p_m(x)≤T₃

wherein,

representative pair

Carrying out upward rounding;