CN118115755B

CN118115755B - Multi-target tracking method, system and storage medium

Info

Publication number: CN118115755B
Application number: CN202410517072.7A
Authority: CN
Inventors: 田虎; 包灵
Original assignee: Sichuan Angzhi Future Technology Co ltd
Current assignee: Sichuan Angzhi Future Technology Co ltd
Priority date: 2024-04-28
Filing date: 2024-04-28
Publication date: 2024-06-28
Anticipated expiration: 2044-04-28
Also published as: CN118115755A

Abstract

The invention discloses a multi-target tracking method, a multi-target tracking system and a storage medium, and belongs to the technical field of computer vision tracking. The method comprises the following steps: s1: an image acquisition stage; s2: target detection and camera motion estimation phases; s3: a target motion compensation stage, namely performing motion compensation on the first target frame coordinate pH according to the target frame set and the camera motion matrix M to obtain a second target frame coordinate pC, and then assigning the second target frame coordinate pC back to the target frame; s4: a target association stage of adjacent image frames, wherein a multi-target association tracking method is used for associating target frames of the adjacent image frames; s5: and in the multi-frame association recovery stage, a lost target set and a newly added target set are set, and a multi-target tracking result is obtained by using a lost recovery strategy. The problem that the lost target cannot be retrieved by the existing adjacent frame association method is solved by multi-frame association retrieval, the feature matching space of multi-frame association retrieval is reduced, and quick retrieval of the lost target can be realized no matter in short time or long time.

Description

Multi-target tracking method, system and storage medium

Technical Field

The present invention relates to the field of computer vision tracking technologies, and in particular, to a multi-target tracking method, system, and storage medium.

Background

The multi-target tracking technology is used for detecting and tracking a plurality of targets in a video, ensures that each target identity is unique, and is widely applied to the fields of monitoring, automatic driving and the like. In recent years, a target detection technology based on deep learning is rapidly developed, and a paradigm of performing target detection and then performing multi-target association becomes a mainstream multi-target tracking implementation scheme. The target detection is mainly responsible for detecting information such as the position, the size and the confidence of a target in an image. The detector of YOLO series, CENTERNET, DETR and the like obtains higher detection precision with real-time performance, and promotes the research of downstream tasks such as target tracking and the like.

The multi-target association is mainly responsible for the data association of adjacent frame tracking targets, and two data association methods exist. The method is a position and motion model method, which predicts the position, speed and other state information of the track in the current frame through a Kalman filtering algorithm based on a uniform model assumption; and calculating the similarity between the detection result and the predicted result IOU, and completing the association of the tracking track and the detection result through a Hungary matching or greedy matching strategy. The other is an appearance model method, which mainly solves the problem of re-matching of a disappeared target caused by long-term shielding, the appearance model introduces the depth model characteristics of the target, and the correlation step is completed by combining the characteristic similarity of the target and a Hungary matching or greedy matching strategy. Representative algorithms mainly include Sort, deepSort, byteSort, bot-Sort based multi-objective tracking algorithms.

However, the complexity of the multi-target tracking scenario presents a significant challenge for target detection as well as multi-target correlation. Under the conditions of occlusion and camera motion, target matching is often inaccurate. The method based on the position and motion model only considers the target association between adjacent frames, and the tracking is lost if the target is shielded or the camera is oversized; however, the method based on the appearance model can realize the re-matching of the target disappearing for a long time, but also can solve the problem that the deep learning model cannot run in real time due to the slow feature extraction speed and large re-matching space.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a multi-target tracking method, a multi-target tracking system and a storage medium.

The aim of the invention is realized by the following technical scheme: the first aspect of the present invention provides: a multi-target tracking method comprising the steps of:

S1: an image acquisition stage of acquiring image data from the optoelectronic device;

s2: in the target detection and camera motion estimation stage, a multi-target detection model is used for carrying out target detection on image data to obtain a target frame set; meanwhile, calculating a camera motion matrix M according to static region feature points of adjacent image frames in the image data;

s3: a target motion compensation stage, namely performing motion compensation on the first target frame coordinate pH according to the target frame set and the camera motion matrix M to obtain a second target frame coordinate pC, and then assigning the second target frame coordinate pC back to the target frame;

s4: a target association stage of adjacent image frames, wherein a multi-target association tracking method is used for associating target frames of the adjacent image frames;

S5: setting a lost target set and a newly added target set in a multi-frame association recovery stage, and obtaining a multi-target tracking result by using a lost recovery strategy;

s6: and a result output stage for outputting the multi-target tracking result by using the result output device.

Preferably, the multi-target detection model is FASTERRCNN model or YOLO model or DETR model; the target frame set comprises a plurality of target frames; the target frame comprises a target frame coordinate, a target frame category, a target frame confidence level and a target frame tracking ID, wherein the target frame coordinate comprises a target frame upper left corner abscissa, a target frame upper left corner ordinate, a target frame lower right corner abscissa and a target frame lower right corner ordinate; and the initial value of the target frame tracking ID is-1.

Preferably, the camera motion matrix M is calculated by the following steps:

Extracting static region feature points, and filtering a target frame with a preset category as a dynamic state; then, matching and associating the static area feature points of the adjacent image frames by using a feature point matching method; then removing noise matching pairs of the dynamic region by using Ransac algorithm to obtain a static region matching pair set; finally, the static region matching pair set is used as input of an Opencv library getAffineTransform () function to obtain a camera motion matrix M.

Preferably, the static region feature points are ORB features or SURF features or SIFT features or SuperPoint features; the characteristic point matching method is violence matching or approximate neighbor matching or SuperGlue matching; the model used by Ransac algorithm is homography matrix H estimation model or essence matrix E estimation model or basic matrix F estimation model; the camera motion matrix M is a matrix of 2 rows and 3 columns.

Preferably, the specific calculation formula of the motion compensation is as follows:

Wherein, pH is the first target frame coordinate, M is the camera motion matrix including six elements a11, a12, a13, a21, a22, a23, pC is the second target frame coordinate, x1 is the first target frame upper left-hand abscissa, y1 is the first target frame upper left-hand ordinate, x2 is the first target frame lower right-hand abscissa, y2 is the first target frame lower right-hand ordinate, xc1 is the second target frame upper left-hand abscissa, yc1 is the second target frame upper left-hand ordinate, xc2 is the second target frame lower right-hand abscissa, yc2 is the second target frame lower right-hand ordinate.

Preferably, the adjacent image frames are t-1 image frames and t image frames; the multi-target association tracking method is a Sort algorithm or DeepSort algorithm or ByteTrack algorithm or BoT-Sort algorithm;

assigning a target frame tracking ID of each target frame in sequence after the target frames of adjacent image frames are associated, assigning tracking IDs of the target frames in sequence from 0 according to the storage sequence of the target frames for the initial image frames, and simultaneously emptying a lost target set of the initial image frames and a newly added target set of the initial image frames;

For the t image frame, if the target frame of the t image frame is already associated with the target frame of the t-1 image frame, assigning a target frame tracking ID of the target frame of the t-1 image frame to the target frame of the t image frame; if the target frame tracking ID of the target frame of the t image frame is not associated, the target frame tracking ID of the target frame of the t image frame is assigned to be the largest target frame tracking ID added with 1, meanwhile, the target frame of the t-1 image frame is added to the lost target set of the t image frame, and the target frame of the t image frame is added to the newly added target set of the t image frame.

Preferably, the loss recovery strategy comprises a short-time loss recovery strategy and a long-time loss recovery strategy; the S5: the multi-frame association recovery stage further comprises: a lost target set management method, a newly added target set management method, a short-time loss judgment method, a short-time loss recovery strategy and a long-time loss recovery strategy;

The lost target set management method comprises a historical lost target coordinate updating stage, a lost target adding stage and a lost target deleting stage; firstly, a historical lost target coordinate updating stage is carried out, each lost target frame coordinate in a lost target set is predicted, and a prediction result is updated to a lost target frame; the prediction comprises the steps of: let the target frame coordinates of any lost target frame in the lost target set at time t be R (tx 1, ty1, tx2, ty 2), and model R as a function of (deltaT, vx1, vy1, alpha, W, H) as follows:

Wherein deltaT is the time difference from the target losing moment in prediction, vx1 is the historical moving speed of the left upper corner abscissa of the lost target frame, vy1 is the historical moving speed of the right lower corner ordinate of the lost target frame, alpha is the expansion coefficient of a rectangular frame with a value of 0-1, W is the width of the image frame where the lost target frame is located, H is the height of the image frame where the lost target frame is located, and the historical moving speed is the losing moment speed or the average speed or the median speed or the maximum speed; then, in a lost target newly-added stage, extracting a lost feature vector from a historical target set of each lost target frame in a t image frame lost target set by using a feature extraction method, and storing the lost feature vectors of all the historical target sets as a historical feature library, wherein the historical target sets are target image areas cached before the lost target frames are lost; then, a t image frame losing target set and a history feature library are newly added to the losing target set; finally, deleting the lost target frame with the time length longer than the maximum existence time length of the lost targets in the lost target set;

The new target set management method comprises a history new target coordinate updating stage, a new target stage and a new target deleting stage; firstly, in a history newly-increased target coordinate updating stage, updating newly-increased target frame coordinates in a newly-increased target set, searching the target frame coordinates of the newly-increased target frame at the latest moment in a target frame set through a target frame tracking ID of the newly-increased target frame, and updating; then, in the new target stage, extracting new feature vectors from each new target frame in the new target set of the t image frames by using a feature extraction method; then, the new added target set and the new added feature vector of the t image frame are newly added to the new added target set; finally, deleting the new target frame with the time length longer than the maximum existing time length of the new target in the new target set;

The short-time loss judging method comprises the following steps: carrying out loss judgment on each lost target frame in the lost target set, marking all lost target frames with the loss time length smaller than the loss time length threshold value as short-time lost target frames, and storing the short-time lost frames into a short-time lost set; recording all lost target frames with the loss time length greater than or equal to the loss time length threshold value as long-time lost target frames and storing the long-time lost target frames into a long-time lost set;

The short-time loss recovery strategy comprises the following steps: firstly judging whether a short-time lost set or a newly-added target set is empty, if so, not carrying out short-time lost target frame recovery, if not, carrying out coordinate matching, predicting the short-time lost target frame to obtain a target prediction frame, carrying out coordinate comparison between the target frame coordinates of the target prediction frame and the target frame coordinates of all newly-added target frames in the newly-added target set, and taking the newly-added target frame completely in the target prediction frame as a candidate newly-added target set; and finally, performing feature matching, namely sequentially performing cosine similarity calculation on the missing feature vector of each short-time missing target frame in the short-time missing set at the missing moment and the newly-added feature vector of each target frame in the candidate newly-added target set, and recovering the short-time missing target frame through matching similarity, wherein the formula is as follows: Wherein b is the target frame with highest similarity in the candidate newly-added target set, bj is the target frame in the candidate newly-added target set, bc is the candidate newly-added target set, xa is the lost feature vector of the short-time lost target frame at the lost moment, xbj is the newly-added feature vector of the target frame in the candidate newly-added target set; if the similarity of b is greater than or equal to the feature matching similarity threshold, the short-time lost target frame is retrieved, the target frame tracking ID of b is set as the target frame tracking ID of the short-time lost target frame, b is deleted from the newly-added target set and the candidate newly-added target set, and the short-time lost target frame is deleted from the lost target set; if the similarity of the b is smaller than the feature matching similarity threshold, the short-time lost target frame is not retrieved;

The long-time loss recovery strategy comprises the following steps: firstly judging whether a long-time lost set or a newly-added target set is empty, if so, not retrieving the long-time lost target frame, if not, performing feature library matching, sequentially performing cosine similarity calculation with newly-added feature vectors of each target frame in the newly-added target set by using a history feature library of the long-time lost target frame at the time of loss, and retrieving the short-time lost target frame through the matching similarity, wherein the formula is expressed as follows: wherein bl is the target frame with highest similarity in the newly-added target set, bk is the target frame in the newly-added target set, B is the newly-added target set, al is the target frame in the long-time lost set, xal is the first lost feature vector in the history feature library of the long-time lost target frame at the lost moment, and xbk is the newly-added feature vector of the target frame in the newly-added target set; if the similarity of the bl is greater than or equal to a feature matching similarity threshold, retrieving the long-term lost target frame, setting the target frame tracking ID of the bl as the target frame tracking ID of the long-term lost target frame, deleting the bl from the newly added target set, and deleting the long-term lost target frame from the lost target set; if the similarity of bl is smaller than the feature matching similarity threshold, the target frame is lost for a long time and is not retrieved.

Preferably, the feature extraction method is a HOG method, a word bag Bow method, a VGG neural network model, a ResNet neural network model, or a NetVLAD neural network model.

A second aspect of the invention provides: a multi-target tracking system for implementing any of the multi-target tracking methods described above, comprising:

The photoelectric device is used for carrying a photoelectric camera and transmitting the image data to the computing device in a wired transmission mode or a wireless transmission mode;

The computing device is used for carrying a computing force unit and a multi-target tracking module, wherein the multi-target tracking module is used for realizing multi-target tracking of the image data to obtain a multi-target tracking result and outputting the multi-target tracking result to the result output device;

The result output device is used for outputting a multi-target tracking result in a wired transmission mode or a wireless transmission mode;

the multi-target tracking module includes: an image acquisition unit for acquiring image data from the optoelectronic device;

the target detection and camera motion estimation unit is used for carrying out target detection on the image data by using the multi-target detection model to obtain a target frame set; meanwhile, calculating a camera motion matrix M according to static region feature points of adjacent image frames in the image data;

The target motion compensation unit is used for performing motion compensation on the first target frame coordinate pH according to the target frame set and the camera motion matrix M to obtain a second target frame coordinate pC, and then assigning the second target frame coordinate pC back to the target frame;

The adjacent image frame target association unit is used for associating target frames of the adjacent image frames by using a multi-target association tracking method;

And the multi-frame association recovery unit is used for setting a lost target set and a newly added target set, and obtaining a multi-target tracking result by using a lost recovery strategy.

A third aspect of the invention provides: a computer storage medium having stored therein computer executable instructions that when loaded and executed by a processor implement any of the multi-objective tracking methods described above.

The beneficial effects of the invention are as follows:

1) The problem that the lost target cannot be retrieved by the existing adjacent frame association method is solved by multi-frame association retrieval; meanwhile, through the designed maintenance strategies of the lost target set and the newly added target set, the feature matching space of multi-frame association retrieval is reduced, and quick retrieval of the lost target can be realized no matter in short time or long time.

2) Correcting the image target error offset caused by camera motion through target motion compensation; at the same time, having the object motion compensation act after the object detection, rather than at the adjacent image frame object association stage, has the advantage that it can be combined with existing adjacent frame object association methods without any modification.

3) The target detection and the camera motion estimation are completely independent, and the calculation time is saved and mutually unrestricted by parallel calculation.

Drawings

FIG. 1 is a flow chart of a multi-objective tracking method;

FIG. 2 is a block diagram of a multi-target tracking system;

FIG. 3 is a multi-frame association recovery flow chart.

Detailed Description

The technical solutions of the present invention will be clearly and completely described below with reference to the embodiments, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention, based on the embodiments of the present invention.

In the present invention, MOT: multi-Object-Tracking, meaning Multi-Object Tracking; kalman filtering: the Kalman filter algorithm can realize the prediction of the image target position; ransac algorithm: the abbreviation of a random sampling coincidence algorithm is an algorithm for calculating mathematical model parameters of data according to a group of sample data sets containing abnormal data to obtain effective sample data; opencv library: a library of generic algorithms for image processing and computer vision.

Referring to fig. 1-3, a first aspect of the present invention provides: a multi-target tracking method comprising the steps of:

In some embodiments, the multi-target detection model is FASTERRCNN model or YOLO model or DETR model; the target frame set comprises a plurality of target frames; the target frame comprises a target frame coordinate, a target frame category, a target frame confidence level and a target frame tracking ID, wherein the target frame coordinate comprises a target frame upper left corner abscissa, a target frame upper left corner ordinate, a target frame lower right corner abscissa and a target frame lower right corner ordinate; and the initial value of the target frame tracking ID is-1.

In some embodiments, the camera motion matrix M is calculated by:

In this embodiment, the preset category is set manually in advance, for example, a vehicle, a person and the like are dynamic categories, and a house, a bridge and the like are static categories; filtering the dynamic target box may reduce interference of the dynamic target.

In some embodiments, the static region feature points are ORB features or SURF features or SIFT features or SuperPoint features; the characteristic point matching method is violence matching or approximate neighbor matching or SuperGlue matching; the model used by Ransac algorithm is homography matrix H estimation model or essence matrix E estimation model or basic matrix F estimation model; the camera motion matrix M is a matrix of 2 rows and 3 columns.

In some embodiments, the motion compensation is specifically calculated as follows:

In some embodiments, the adjacent image frames are t-1 image frames and t image frames; the multi-target association tracking method is a Sort algorithm or DeepSort algorithm or ByteTrack algorithm or BoT-Sort algorithm;

In some embodiments, the loss recovery strategy includes a short-time loss recovery strategy and a long-time loss recovery strategy; the S5: the multi-frame association recovery stage further comprises: a lost target set management method, a newly added target set management method, a short-time loss judgment method, a short-time loss recovery strategy and a long-time loss recovery strategy;

In the embodiment, the lost target recovery matching space can be greatly reduced by maintaining the lost target set and the newly added target set B; the short-time loss recovery strategy assumes that the appearance, shape and size of the short-time target are slightly changed, and the short-time multi-frame information is fully utilized to combine the predicted coordinates of the target frame and the target characteristics at the time of loss to carry out matching recovery. According to the long-time lost recovery strategy, the feature library is built by extracting features of historical targets due to large changes of the appearance, the shape and the size of the targets, lost targets are recovered based on feature library matching, and meanwhile, the feature matching space can be controlled in a smaller range by maintaining the sizes of lost targets and newly-increased targets. The feature extraction method may be any image feature extraction method, and in this embodiment, a VGG16 model is used to extract feature vectors of the target.

In some embodiments, the feature extraction method is a HOG method, a bag of words (bag of word) method, a VGG neural network model, resNet neural network model, netVLAD neural network model.

The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. A multi-target tracking method is characterized in that: the method comprises the following steps:

s6: a result output stage for outputting a multi-target tracking result by using a result output device;

The loss recovery strategy comprises a short-time loss recovery strategy and a long-time loss recovery strategy; the S5: the multi-frame association recovery stage further comprises: a lost target set management method, a newly added target set management method, a short-time loss judgment method, a short-time loss recovery strategy and a long-time loss recovery strategy;

2. The multi-target tracking method of claim 1, wherein: the multi-target detection model is FASTERRCNN model or YOLO model or DETR model; the target frame set comprises a plurality of target frames; the target frame comprises a target frame coordinate, a target frame category, a target frame confidence level and a target frame tracking ID, wherein the target frame coordinate comprises a target frame upper left corner abscissa, a target frame upper left corner ordinate, a target frame lower right corner abscissa and a target frame lower right corner ordinate; and the initial value of the target frame tracking ID is-1.

3. The multi-target tracking method of claim 1, wherein: the camera motion matrix M is calculated by the following steps:

4. A multi-target tracking method according to claim 3, characterized in that: the static region feature points are ORB features or SURF features or SIFT features or SuperPoint features; the characteristic point matching method is violence matching or approximate neighbor matching or SuperGlue matching; the model used by Ransac algorithm is homography matrix H estimation model or essence matrix E estimation model or basic matrix F estimation model; the camera motion matrix M is a matrix of 2 rows and 3 columns.

5. The multi-target tracking method of claim 1, wherein: the motion compensation concrete calculation formula is as follows:

6. The multi-target tracking method of claim 1, wherein: the adjacent image frames are t-1 image frames and t image frames; the multi-target association tracking method is a Sort algorithm or DeepSort algorithm or ByteTrack algorithm or BoT-Sort algorithm;

7. The multi-target tracking method of claim 1, wherein: the feature extraction method is a HOG method, a word bag Bow method, a VGG neural network model, a ResNet neural network model and a NetVLAD neural network model.

8. A multi-target tracking system, characterized by: a method for implementing a multi-target tracking method as claimed in any one of claims 1 to 7, comprising:

The multi-frame association recovery unit is used for setting a lost target set and a newly added target set, and obtaining a multi-target tracking result by using a lost recovery strategy;

9. A computer storage medium, characterized by: the computer readable storage medium has stored therein computer executable instructions which, when loaded and executed by a processor, implement the multi-objective tracking method according to any of claims 1-7.