CN110782483A

CN110782483A - Multi-view multi-target tracking method and system based on distributed camera network

Info

Publication number: CN110782483A
Application number: CN201911012807.6A
Authority: CN
Inventors: 刘国良; 和立; 田国会
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2020-02-11
Anticipated expiration: 2039-10-23
Also published as: LU102028B1; CN110782483B

Abstract

The utility model discloses a multi-view multi-target tracking method and a system based on a distributed camera network, comprising: acquiring a current frame view acquired by each camera in the distributed camera network; extracting a rectangular bounding box of a detected target from the current frame view; extracting visual appearance information of the detected target from the image inside the rectangular bounding box by using a pre-trained convolutional neural network; converting the image coordinates of the detected target in the current frame view into ground coordinates; an output step: constructing a data association matrix based on the visual appearance information and the ground coordinates of the detected target; and processing the data association matrix by adopting a Hungarian algorithm, and outputting the result of successful matching or failed matching of the detected target and the known track in the current frame view.

Description

Multi-view multi-target tracking method and system based on distributed camera network

Technical Field

The disclosure relates to the technical field of multi-target tracking, in particular to a multi-view multi-target tracking method and system based on a distributed camera network.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

multi-Object Tracking (Multiple Object Tracking) technology has many applications in today's society, such as monitoring, surveillance, crowd behavior analysis, etc.

In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:

multi-target tracking remains a challenging task because it requires simultaneous solution to the problems of target detection, trajectory estimation, data association, and re-identification. In order to detect the target, various sensors such as radar, laser, sonar, camera and the like can be used according to the requirements of specific tasks, and corresponding detection algorithms are also required to be equipped, and the detection of the target is one of the difficulties of multi-target tracking. Another challenging problem of multi-target tracking is occlusion. The target can be shielded by other objects and can also be shielded by the current View (Field Of View), and frequent shielding easily causes the target to be lost and influences the tracking precision.

Disclosure of Invention

In order to solve the defects of the prior art, the disclosure provides a multi-view multi-target tracking method and system based on a distributed camera network;

in a first aspect, the present disclosure provides a distributed camera network based multi-view multi-target tracking method;

the multi-view multi-target tracking method based on the distributed camera network comprises the following steps:

acquiring a current frame view acquired by each camera in the distributed camera network;

extracting a rectangular bounding box of a detected target from the current frame view;

extracting visual appearance information of the detected target from the image inside the rectangular bounding box by using a pre-trained convolutional neural network; converting the image coordinates of the detected target in the current frame view into ground coordinates;

an output step: constructing a data association matrix based on the visual appearance information and the ground coordinates of the detected target; and processing the data association matrix by adopting a Hungarian algorithm, and outputting the result of successful matching or failed matching of the detected target and the known track in the current frame view.

In a second aspect, the present disclosure further provides a distributed camera network-based multi-view multi-target tracking system;

a multi-view multi-target tracking system based on a distributed camera network comprises:

an acquisition module configured to: acquiring a current frame view acquired by each camera in the distributed camera network;

a pre-processing module configured to: extracting a rectangular bounding box of a detected target from the current frame view;

an extraction module configured to: extracting visual appearance information of the detected target from the image inside the rectangular bounding box by using a pre-trained convolutional neural network; converting the image coordinates of the detected target in the current frame view into ground coordinates;

an output module configured to: constructing a data association matrix based on the visual appearance information and the ground coordinates of the detected target; and processing the data association matrix by adopting a Hungarian algorithm, and outputting the result of successful matching or failed matching of the detected target and the known track in the current frame view.

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.

Compared with the prior art, the beneficial effect of this disclosure is:

the method adopts a data incidence matrix generated by combining visual appearance information and ground coordinates, and adopts the data incidence matrix to realize the matching of the detected target and the known track, so that the matching accuracy can be improved;

compared with the original DeepsORT method, the method has stronger robustness in processing the target re-identification and occlusion problems.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

Fig. 1 is an overall structure of a distributed multi-view multi-target tracking system of the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiment I provides a multi-view multi-target tracking method based on a distributed camera network;

s1: acquiring a current frame view acquired by each camera in the distributed camera network;

s2: extracting a rectangular bounding box of a detected target from the current frame view;

s3: extracting visual appearance information of the detected target from the image inside the rectangular bounding box by using a pre-trained convolutional neural network; converting the image coordinates of the detected target in the current frame view into ground coordinates;

s4: an output step: constructing a data association matrix based on the visual appearance information and the ground coordinates of the detected target; and processing the data association matrix by adopting a Hungarian algorithm, and outputting the result of successful matching or failed matching of the detected target and the known track in the current frame view.

As one or more embodiments, in S4, the specific steps of the outputting step include:

calculating the Mahalanobis distance between the ground coordinates of the detected target and the terminal coordinates of each storage track in the current frame view;

calculating M cosine distances between the visual appearance information of the detected target and the visual appearance information of the adjacent previous M frames of the current frame view; storing the minimum value of the M cosine distances as the final cosine distance;

when both the mahalanobis distance and the final cosine distance are smaller than a set threshold, performing weighted summation on the mahalanobis distance and the final cosine distance to obtain a data association matrix;

and inputting the data association matrix into a Hungarian algorithm, and outputting the result of successful matching or failed matching of the detected target and the known track in the current frame view by the Hungarian algorithm.

As one or more embodiments, the method further comprises:

s5: if the image coordinate information of the detected target successfully matched with the current camera and the corresponding track number ID exist in the current camera; repeatedly and iteratively exchanging the successfully matched information in the current camera and the successfully associated information in the adjacent camera, and calculating the average consistency to obtain a converged information vector and a converged information matrix;

calculating posterior pose information based on the converged information vector and the converged information matrix; therefore, multi-view and multi-target tracking is realized; then, position information of the detected object of the next frame view is predicted.

As one or more embodiments, the method further comprises:

s6: if the ground coordinates of the detected target which fails to be matched with the corresponding track number ID exist in the current camera; calculating Euclidean distances between the coordinate information of the detected target which fails to be matched in the current camera and the coordinate information of each storage track terminal point in the views shot by the rest other cameras;

and if the Euclidean distance is smaller than the set threshold value, matching the ground coordinates of the detected target in the current camera with corresponding tracks in the views shot by the other remaining cameras.

As one or more embodiments, a distributed camera network, refers to: distributed as opposed to centralized, which is where data is centralized for processing at a central processor; distributed is where data is processed separately on individual processors, which may communicate with each other.

As one or more embodiments, in S2, the rectangular bounding box of the detected object is extracted from the current frame view by using the YOLOv3 network.

As one or more embodiments, in S3, the training step of the pre-trained convolutional neural network includes:

constructing a convolutional neural network; constructing a training set; the training set is an image with known visual appearance information;

inputting the training set into a convolutional neural network, and training the convolutional neural network;

and obtaining the trained convolutional neural network.

For example: the training set is as follows: a large-scale pedestrian re-identification dataset containing over 1100000 images of 1261 pedestrians.

As one or more embodiments, in S3, the visual appearance information specifically refers to 128-dimensional normalized features of the convolutional neural network output. For example, it may be a profile feature.

As one or more embodiments, in S3, the image coordinate information of the detected object in the current frame view is converted into ground coordinates; the method comprises the following specific steps:

the pixel coordinates of the middle point of the bottom edge of the bounding box of the person in the image are used as the position information of the person, and the position information is converted into ground coordinates through a homography matrix (homography matrix), and the homography matrix is calibrated through a camera.

As one or more embodiments, in S5, the specific step of calculating the average consistency includes:

based on an ICF algorithm, information exchange is carried out between adjacent cameras through repeated iteration to obtain a converged information vector and a converged information matrix:

wherein e represents a constant,

the information vector is represented by a vector of information,

representing an information matrix.

As one or more embodiments, in S5, the specific step of calculating posterior pose information includes:

posterior state vector under current frame

And a posteriori information matrix

Wherein N is the number of cameras.

As one or more embodiments, in S5, position information of a detected object of a next frame view is predicted; the method comprises the following specific steps:

predicting state variables of a target next step

And information matrix

Wherein i refers to the ith node, t is the tth frame, Φ is the linear state transition matrix, and Q is the process noise covariance.

The general structure of the distributed multi-target tracking method provided by the disclosure is shown in fig. 1. Firstly, the YOLOv3 is used to detect the target of each camera, and the algorithm can extract the rectangular bounding box of the detected target with higher frame rate. Then, the visual appearance information of the target is obtained by a pre-trained convolutional neural network. The Hungarian algorithm is used for combining the visual appearance information and the position information of the target to carry out data association. The position information of the object can be fused from the plurality of view information by using an information weighted Consensus Filter (information weighted Consensus Filter). The method comprises the following specific steps:

1. target detection

Object detection refers to acquiring different objects in an image and determining their type and location. The target detection method based on deep learning has strong robustness to illumination change, shielding problems and complex environments, and has two main research directions: a two-stage process and a one-stage process. The two-stage method first predicts the number of candidate frames that may have targets, then adjusts the size of the frames and classifies the frames to obtain the precise location, size and category of the targets, such as Faster R-CNN. The first step of the one-stage method is omitted, and the position and the category of the target are directly predicted, such as YOLOv3, compared with the two stages, the one-stage method is generally faster and has equivalent performance. Therefore, we chose YOLOv3 as the target detector.

2. Data association

A simple Hungarian algorithm is used for data correlation, visual appearance information of a target is a 128-dimensional feature vector obtained by a trained convolutional neural network, position information of the target is obtained by converting image coordinates to the ground by using a corrected homography matrix (homography matrix), and a final correlation matrix is obtained by weighting the visual appearance information and the target position information:

c _i,j＝λd ⁽¹⁾(i,j)+(1-λ)d ⁽²⁾(i,j) (1)

where i, j represent the ith and jth measurements, respectively, λ is an adjustable weighting parameter, d ⁽¹⁾For measuring the Mahalanobis distance between the position and the last frame of each stored track of the current view, d ⁽²⁾The minimum cosine distance between the appearance information and the appearance information stored in each track is measured.

It should be understood that the homography matrix refers to a homography matrix obtained through camera calibration, and the homography matrix can convert pixel coordinates into ground coordinates.

Furthermore, we use a threshold function to ignore irrelevant candidates:

where k is equal to 1 or 2, representing appearance information or position information, the association between the ith trajectory and the jth tracked object is only allowed if both the mahalanobis and cosine distances are less than the corresponding thresholds, i.e., b is set to 1.

3. Track processing using multi-view information

The track processing step is used for ID management, including restoring an old track or creating a new track. The old track is recovered, namely, after a person walks out of the visual field for 30 frames, the track of the person is deleted, and when the person returns again, the original ID is given to the person; creating a new trajectory is a new person entering the field of view assigned a new ID. When a detection value failing to be matched is found in the current view, the position information of the detection value is matched with the position information of the last frame of the track by using the Euclidean distance, the old track is recovered, and if a matched candidate is found, the ID of the track is given to the detection value.

If the match fails again, the algorithm checks whether a match occurs in the other views, i.e. a detected match with the same match failure in the other views, for the generation of a new track, and if the distance meets the threshold requirement, a new ID is initialized for them. In addition, the algorithm will remove the trajectory that disappears more than 30S in the current view to reduce the interference.

4. Information weighted consensus filter for multi-view information fusion

An information weighted coherent filter (ICF) is an efficient distributed state estimation method. Here, we use ICF for multi-view information fusion to estimate the location of the target. ICF mainly comprises three steps: state prediction, measurement update and weighting are consistent. In the aspect of state prediction, in S6, we use a linear uniform velocity model to predict the state variable x and the information matrix W of the target next step.

Wherein i refers to the ith node, t is the tth frame, Φ is the linear state transition matrix, and Q is the process noise covariance. The predicted location information is sent to a data correlation module to be matched with the measured values. At measurement update, we use the current measurement value z _iTo calculate an information vector v _iAnd an information matrix V _i。

In the formula H _i，R _iAnd N is a prior state vector, an information matrix, an observation matrix, a measurement noise covariance and the number of cameras respectively. For weighted agreement, each camera sends and accepts information vectors v to its neighboring cameras _iAnd an information matrix V _iK iterations are passed until the filter converges.

Where e represents a constant.

Finally obtaining the posterior state vector under the current frame

And information matrix

Wherein N is the number of cameras.

The second embodiment also provides a multi-view multi-target tracking system based on the distributed camera network;

In a third embodiment, the present embodiment further provides an electronic device, which includes a memory, a processor, and computer instructions stored in the memory and executed on the processor, where the computer instructions, when executed by the processor, implement the steps of the method according to the first aspect.

In a fourth embodiment, the present embodiment further provides a computer-readable storage medium for storing computer instructions, and the computer instructions, when executed by a processor, perform the steps of the method according to the first aspect.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The multi-view multi-target tracking method based on the distributed camera network is characterized by comprising the following steps:

2. The method of claim 1, wherein said outputting step comprises the specific steps of:

3. The method of claim 1, further comprising:

if the image coordinate information of the detected target successfully matched with the current camera and the corresponding track number ID exist in the current camera; repeatedly and iteratively exchanging the successfully matched information in the current camera and the successfully associated information in the adjacent camera, and calculating the average consistency to obtain a converged information vector and a converged information matrix;

4. The method of claim 3, further comprising:

if the ground coordinates of the detected target which fails to be matched with the corresponding track number ID exist in the current camera; calculating Euclidean distances between the coordinate information of the detected target which fails to be matched in the current camera and the coordinate information of each storage track terminal point in the views shot by the rest other cameras;

5. The method of claim 3, wherein the extracting the rectangular bounding box of the detected object from the current frame view is performed by using a YOLOv3 network to extract the rectangular bounding box of the detected object from the current frame view.

6. The method of claim 3, wherein the pre-trained convolutional neural network comprises the specific training steps of:

and obtaining the trained convolutional neural network.

7. The method of claim 1, wherein image coordinates of the detected object in the current frame view are converted into ground coordinates; the method comprises the following specific steps:

the pixel coordinates of the middle point of the bottom edge of the bounding box of the person in the image are used as the position information of the person, and the position information is converted into ground coordinates through a homography matrix, wherein the homography matrix is calibrated through a camera.

8. A multi-view multi-target tracking system based on a distributed camera network is characterized by comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.