CN117853759A

CN117853759A - Multi-target tracking method, system, equipment and storage medium

Info

Publication number: CN117853759A
Application number: CN202410262998.6A
Authority: CN
Inventors: 顾雪平; 阚国泽; 张洪辉; 潘晓东; 王一帆; 庞梦娇
Original assignee: Shandong Hairun Shuju Technology Co ltd
Current assignee: Shandong Hairun Shuju Technology Co ltd
Priority date: 2024-03-08
Filing date: 2024-03-08
Publication date: 2024-04-09
Anticipated expiration: 2044-03-08
Also published as: CN117853759B

Abstract

The invention belongs to the technical field of image processing, and particularly relates to a multi-target tracking method, a system, equipment and a storage medium, which are used for carrying out bounding box processing on image data in a video, further respectively detecting and correlating extracted appearance characteristics and bounding box characteristics, obtaining a preliminary tracking track through Top-K score screening, carrying out track updating by combining a constructed graph to obtain a multi-target tracking track, and improving the accuracy of multi-target tracking; and by sensing the characteristics of the image data of the adjacent past frames and the adjacent future frames of the current frame and respectively aggregating, more context information can be obtained, and the capturing capability of the continuity of the target track is improved.

Description

Multi-target tracking method, system, equipment and storage medium

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a multi-target tracking method, a system, equipment and a storage medium.

Background

In recent years, image feature analysis is performed on a captured image to realize target tracking, and the method is widely used in a multi-target tracking technique. Detection-based tracking has become a major paradigm in multi-target tracking (MOT) tasks, i.e., tracking is considered an association problem given the detection results. This tracking-detection framework allows for the incorporation of a variety of target cues into the tracking scheme. These cues include the following: first, the smoothness of the target trajectory in the time domain is exploited, which is determined by the high frame rate of the camera and the slow movement of the target. Second, the appearance characteristics of each detected object are considered, as the appearance characteristics from the same object should be similar, while the characteristics from different objects are typically different. Finally, consider interactive cues between different targets, including relationships between adjacent targets.

Currently, the studies of graph-based multi-objective tracking can be broadly divided into two aspects: on the one hand, the improvement of the cost is focused. This type of approach places an emphasis on improving edge costs using deep learning techniques. By using a twin Convolutional Neural Network (CNN) to encode reliable pairwise interactions between targets, but this approach does not take into account the critical features of object movement in the actual scene, creating a problem of correlation errors. On the other hand, emphasis is placed on the construction of the graph. Much research has been devoted to building a complex graph optimization framework that encodes the detection of higher-order dependencies between each other by combining multiple information sources. However, these methods cannot improve the target shielding and crowding conditions existing in the complex real scene, which results in the loss of the target track and affects the accuracy of multi-target tracking.

Disclosure of Invention

The invention provides a multi-target tracking method, a multi-target tracking system, multi-target tracking equipment and a storage medium.

The technical scheme of the invention is as follows:

the invention provides a multi-target tracking method, which comprises the following steps:

s1: acquiring image data of a plurality of adjacent frames in a video, and performing boundary frame processing on targets in the image data of the plurality of adjacent frames;

s2: the method comprises the steps of extracting appearance features and boundary frame features from image data under a plurality of adjacent frames after processing a boundary frame through convolution, respectively detecting and correlating the appearance features and the boundary frame features extracted from the plurality of adjacent frames to obtain a plurality of tracks, and obtaining a preliminary tracking track after the plurality of tracks are screened by Top-K scores based on the extracted appearance features and the boundary frame features;

s3: based on the preliminary tracking track, constructing a graph by taking boundary frame features as motion features and appearance features as visual features, wherein the motion features are taken as features of edges of the constructed graph, the visual features are taken as features of nodes of the constructed graph, and if the two nodes meet all the following conditions:

(1) The distance between the center coordinates of the two nodes is smaller than a preset distance;

(2) The cosine similarity between the features of the two nodes is greater than a cosine similarity threshold;

(3) The cross-over ratio of the two nodes is greater than a cross-over ratio threshold;

connecting the two nodes through edges to obtain an updated track;

s4: and updating the graph based on the updated track, respectively aggregating the characteristics of the nodes and the characteristics of the edges which are connected in the adjacent past frames and the adjacent future frames of the current frame, and then embedding the characteristics of the nodes and the characteristics of the edges of the current frame as the characteristics of the nodes and the characteristics of the edges of the updated current frame, and after adding one to the current frame, executing S1 until all frames in the video are processed, so as to obtain the multi-target tracking track.

Before the detection association in S2 of the present invention, the method further includes performing an optimization process on the boundary frame feature in the current frame image data, specifically,

four vertexes of the boundary frame characteristics in the adjacent future frame image data of the current frame are obtained respectively, four polar lines are led through the corresponding four vertexes respectively, four vertexes of the boundary frame characteristics in the adjacent future frame image data of the current frame are obtained according to a cost function, the four vertexes of the boundary frame characteristics in the adjacent future frame image data are intersected with the four polar lines respectively to obtain the prediction boundary frame characteristics in the adjacent future frame image data, and if the intersection ratio of the prediction boundary frame characteristics in the adjacent future frame image data and the boundary frame characteristics extracted from the adjacent future frame image data is larger than an intersection ratio threshold value, the boundary frame characteristics in the current frame image data and the adjacent future frame image data are optimized according to the prediction boundary frame characteristics in the adjacent future frame image data, so that the boundary frame characteristics in the optimized current frame image data are obtained for detection association.

The detection association in S2 of the present invention, specifically,

based on the extracted appearance features and the boundary frame features, if the similarity of the extracted appearance features in the adjacent frames is greater than an appearance feature similarity threshold and the intersection ratio of the extracted boundary frame features in the adjacent frames is greater than an intersection ratio threshold, respectively associating the extracted appearance features in the adjacent frames with the boundary frame features.

The step S4 of the invention is to obtain the multi-target tracking track, and further comprises the steps of carrying out edge classification on the multi-target tracking track, predicting edge scores, specifically,

based on the characteristics of edges in the multi-target tracking track, calculating the probability that targets in image data of adjacent past frames and adjacent future frames are the same target by utilizing a Hungary algorithm of an edge scoring matrix, and if the probability is larger than a preset probability threshold, reserving the multi-target tracking track in the current frame.

Before the multi-target tracking track is obtained in the step S4, the method also comprises the step of detecting the missed target in the current frame image data by adopting a single-target tracking method.

Before the multi-target tracking track is obtained in S4 of the present invention, the method further includes processing the missed targets in the continuous frame image data, specifically,

based on the extracted appearance features and the boundary box features, calculating the cost of each target in the missed target and the multi-target tracking tracks, if the cost of the missed target and a certain target in the multi-target tracking tracks is smaller than a preset cost threshold, matching the missed target with a certain target in the multi-target tracking tracks, and constraining one multi-target tracking track to be associated with one missed target at most and one missed target to be associated with one multi-target tracking track at most.

The processing of the boundary frame in the S1 comprises the steps of determining the height of the boundary frame, the width of the boundary frame, the center of the boundary frame and the frame index.

The present invention also provides a multi-target tracking system comprising:

an image preprocessing module: the method comprises the steps of acquiring image data of a plurality of adjacent frames in a video, and carrying out boundary frame processing on targets in the image data of the plurality of adjacent frames;

the preliminary tracking track generation module: the method comprises the steps of convolving image data under a plurality of adjacent frames processed by a boundary frame to extract appearance characteristics and boundary frame characteristics, respectively detecting and correlating the appearance characteristics and the boundary frame characteristics extracted from the plurality of adjacent frames to obtain a plurality of tracks, and screening the plurality of tracks by Top-K scores based on the extracted appearance characteristics and the boundary frame characteristics to obtain a preliminary tracking track;

the construction module of the graph: based on the preliminary tracking track, constructing a graph by taking boundary frame features as motion features and appearance features as visual features, wherein the motion features are taken as features of edges of the constructed graph, the visual features are taken as features of nodes of the constructed graph, and if the two nodes meet all the following conditions:

connecting the two nodes through edges to obtain an updated track;

a multi-target tracking track generation module: and updating the graph based on the updated track, respectively aggregating the characteristics of the nodes and the characteristics of the edges which are connected in the adjacent past frames and the adjacent future frames of the current frame, and then embedding the characteristics of the nodes and the characteristics of the edges of the current frame as the characteristics of the nodes and the characteristics of the edges of the updated current frame, adding one to the current frame, and entering an image preprocessing module until all frames in the video are processed, so as to obtain the multi-target tracking track.

The invention also provides multi-target tracking equipment, which comprises a processor and a memory, wherein the multi-target tracking method is realized when the processor executes the computer program stored in the memory.

The invention also provides a multi-target tracking storage medium for storing a computer program, wherein the computer program realizes the multi-target tracking method when being executed by a processor.

The beneficial effects are that: according to the method, the boundary box processing is carried out on the image data in the video, so that the extracted appearance characteristics and the boundary box characteristics are respectively detected and associated, the preliminary tracking track is obtained through Top-K score screening, the track is updated by combining with the constructed graph, the multi-target tracking track is obtained, and the accuracy of multi-target tracking is improved;

according to the invention, the polar lines are introduced to obtain the optimized position of the target boundary frame of the current frame, and the appearance characteristics and the boundary frame characteristics of the target can be considered in track generation, so that the accuracy of target association in a complex scene is improved, and the influence of camera motion is reduced through association of adjacent frames, and the overall association accuracy is improved;

according to the invention, through sensing and respectively aggregating the embedded information of the characteristics of the nodes and the edges of the adjacent past frames and the adjacent future frames of the current frame, more context information can be obtained, so that the continuity of the target track is maintained, especially in complex scenes such as occlusion, camera movement and the like, the problems of target track loss and the like caused by occlusion or camera movement in an actual scene are solved, and the capturing capability of the continuity of the target track is improved.

Drawings

Figure 1 is a flow chart of the multi-target tracking method of the present application,

fig. 2 is a schematic diagram of detection of prediction bounding box features in image data of adjacent future frames based on epipolar lines in the present application, where (a) is a first target detected by the t-th frame, (b) is a target detected by the t+1th frame, (c) is a result of intersection of the target prediction bounding box of the t+1th frame and four epipolar lines based on epipolar lines, (d) is a prediction bounding box for obtaining a target optimum based on epipolar lines,

fig. 3 is a schematic diagram of node update during message passing in a constructed diagram in different manners, where (a) is an initial setting of node update during message passing, (b) is a prior art node update, and (c) is a node update of the present application.

Detailed Description

The following examples are intended to illustrate the invention, but not to limit it further.

The invention provides a multi-target tracking method, as shown in figure 1, comprising the following steps:

connecting the two nodes through edges to obtain an updated track;

According to the method, the boundary box processing is carried out on the image data in the video, the extracted appearance characteristics and the boundary box characteristics are detected and associated respectively, the preliminary tracking track is obtained through Top-K score screening, the constructed graph is combined, track updating is carried out, the multi-target tracking track is obtained, and the accuracy of multi-target tracking is improved.

S1: and acquiring image data of a plurality of adjacent frames in the video, and carrying out boundary frame processing on targets in the image data of the plurality of adjacent frames.

To further achieve accurate tracking of the target position, the bounding box processing in S1 includes determining a height h of the bounding box _t Width w of bounding box _t Center of bounding box (x _t ，y _t ) Frame index t.

The method adopts a boundary frame to realize the preliminary positioning of the target position in the image data, wherein the track is formed by connecting a series of boundary frames of different frames, a boundary frame on the track is generated by using the t-th frame to represent a target, and the target is represented by usingRepresenting a set of objects that occurred prior to the t-th frame of image data. Each element W e W in the set W is then taken to represent the track of the target in different frames, i.e. the same targetA set of consecutive detections at different frames over a period of time. In addition, adopt D _t Representing a set of objects to be detected.

S2: and (3) extracting appearance characteristics and boundary frame characteristics from the image data of a plurality of adjacent frames processed by the boundary frame after convolution, respectively detecting and correlating the appearance characteristics and the boundary frame characteristics extracted from the plurality of adjacent frames to obtain a plurality of tracks, and screening the plurality of tracks through Top-K scores based on the extracted appearance characteristics and the boundary frame characteristics to obtain a preliminary tracking track.

The method extracts the appearance characteristic and the boundary frame characteristic of each target in the t-th frame image data at the same time, and defines the detection score D based on the appearance characteristic _score . That is, in the complete track of each object, a track having a dimension d is included _ob Is described, and the objects of the appearance features and bounding box features.

The appearance features refer to features for describing the appearance of the target, and generally include information of color, texture, shape and the like of the target.

The boundary box characteristics refer to boundary box parameters of each detection result, wherein the boundary box parameters comprise height and width.

The detection score, which may be considered a measure of the appearance characteristics of the object, represents the salience or confidence of the object in the image.

In addition, it should be noted that, due to unreliable detection results, the complete track of an object may be divided into track segments.

In addition, before the detection association, the method further comprises the step of optimizing the boundary box characteristics in the current frame image data. Considering that rapid camera movement may affect the accuracy of target tracking, the present application assumes that the target is moving slowly or stationary, and predicts the predicted bounding box features in the image data of adjacent future frames by first introducing epipolar lines.

Specifically, four vertexes of boundary frame characteristics in adjacent past frames of the current frame and boundary frame characteristics in current frame image data are respectively obtained, four polar lines are respectively led through the corresponding four vertexes, four vertexes of boundary frame characteristics in adjacent future frame image data of the current frame are obtained according to a cost function, the four vertexes of boundary frame characteristics in the adjacent future frame image data are respectively intersected with the four polar lines to obtain prediction boundary frame characteristics in the adjacent future frame image data, if the intersection ratio of the prediction boundary frame characteristics in the adjacent future frame image data and the boundary frame characteristics extracted in the adjacent future frame image data is larger than an intersection ratio threshold value, the boundary frame characteristics in the current frame image data and the boundary frame characteristics in the adjacent future frame image data are optimized according to the prediction boundary frame characteristics in the adjacent future frame image data, and the boundary frame characteristics in the current frame image data are obtained for detection association.

For example, four vertices of the target bounding box in the t-th frame are defined as J _i,t Where i ε {1,2,3,4}. Similarly, will J _i,t+1 I e {1,2,3,4} is defined as the bounding box in the t+1st frame. Then define a cost function：

Wherein,ensuring that the predicted t+1st frame target bounding box intersects as much as possible with the four corresponding epipolar lines,/->Target size constraints are used to ensure that the predicted t+1st frame target bounding box aligns with the true position of the t+1st frame target as much as possible. The accuracy of the predicted t+1st frame target bounding box position can be ensured using a cost function.

Furthermore, the boundary frame feature in the current frame image data is optimized by matching feature points (SURF points) in the t-th frame image data and the t+1-th frame image data by using the RANSAC algorithm through the base matrix η.

Among them, SURF (Speeded Up Robust Features) is an algorithm for feature points in computer vision. When the RANSAC algorithm is used to match SURF points between two consecutive frames to estimate the basis matrix, this means that the feature points extracted by the SURF algorithm are used for image-to-image matching. This matching can be used to calculate a basis matrix between the two images, thereby enabling relative positioning or motion estimation between the images. The RANSAC algorithm can help to exclude some false matches and improve the accuracy of the matches.

As shown in FIG. 2, where (a) is the first object detected for the t-th frame, X _1,t Representing the position of the upper left corner of the target bounding box detected in the t-th frame, X _2,t Representing the position of the upper right corner of the target bounding box detected in the t-th frame, X _3,t Representing the position of the lower right corner of the target bounding box detected in the t-th frame, X _4,t Representing the position at which the lower left corner of the target bounding box was detected in the t-th frame. (b) For the target detected by the t+1st frame, the dashed box represents (a) the predicted position of the first target detected by the t frame in the t+1st frame, and the two solid border boxes represent the actual positions of the two targets detected by the t+1st frame.

As can be seen from (a) and (b), in the t+1st frame, the first target detected in the t frame detects that the predicted position (dashed box) of the target has a larger intersection ratio (IoU) with the actual position (right solid border box) of the other target, which indicates that the two different targets have higher similarity or overlap, while the predicted position (dashed box) of the target does not have good overlap with the actual position (left solid border box) of the target, which indicates that the tracking method is not accurate when no polar line is introduced, and is easy to be associated with errors.

IoU (cross-over ratio) is an indicator of how well two bounding boxes overlap. If IoU is larger, the larger the overlapping portion of the two bounding boxes is illustrated, i.e., the higher the similarity between the objects.

The tracking method of the lead-in line is that firstly, if the target is assumed to be stationary or slow to move, the boundary box of the target is at the four vertexes X of the t-th frame _i,t Should be located on the corresponding pole line of the t+1st frame, i.e. at the tthe target prediction bounding box of the t+1 frame should intersect as much as possible with the four epipolar lines as shown in (c) of fig. 2. After epipolar introduction, i.e., in frame t+1, the target prediction bounding box position (white bounding box in fig. 2 (d)) overlaps as much as possible with the actual bounding box position of the target.

Second, assuming that the size of the bounding box does not vary much between adjacent frames, a target-optimal prediction bounding box can be obtained, as shown by the dark bounding box in fig. 2 (d). Wherein X in FIG. 2 (d) _1,t+1 X represents the position of the upper left corner of the target-optimal prediction bounding box in the t+1st frame _2,t+1 X represents the position of the upper right corner of the target-optimal prediction bounding box in the t+1st frame _3,t+1 X represents the position of the bottom right corner of the target-optimal prediction bounding box in the t+1st frame _4,t+1 Representing the position of the bottom left corner of the target-optimal prediction bounding box in the t+1 frame.

According to the method and the device, the polar lines are introduced to obtain the optimized position of the target boundary frame of the current frame, and the appearance characteristics and the boundary frame characteristics of the target can be considered in track generation, so that the accuracy of target association in a complex scene is improved, the influence of camera motion is reduced through association of adjacent frames, and the overall association accuracy is improved.

In addition, in order to make track generation simpler and more convenient, based on the extracted appearance features and the boundary frame features, if the similarity of the extracted appearance features in the adjacent frames is greater than an appearance feature similarity threshold value and the intersection ratio of the extracted boundary frame features in the adjacent frames is greater than an intersection ratio threshold value, the extracted appearance features in the adjacent frames and the boundary frame features are respectively associated, so that the association error is ensured to be as small as possible.

Further, the threshold settings are sensitive to the distribution of detection scores, resulting in the need for calibration from different data sets and detectors, considering the use of the threshold to screen the trajectory.

Top-K score detection can compensate for these missed detections by selecting detection results with high scores when certain targets are not detected correctly. Thus, even if the detector does not completely cover all targets, the multi-target tracking method provided by the present application still has an opportunity to capture missed targets.

According to the method, the Top-K score screening is carried out on a plurality of tracks, and the detection result with high score is selected, so that the condition of missing detection of target detection is compensated.

the two nodes are connected by an edge to obtain an updated track.

For the track generation process, a graph model is defined, video data is converted into a graph, the track of each target is regarded as a node, and edges are generated by association of two nodes.

Specifically, the drawings are defined asAnd the motion features and the visual features are respectively taken as feature sets of the edge (E) and the node (V), and edge embedding and node embedding are respectively generated for each edge and each node. Thus for each nodeAll have a node embedded value +.>For each edge->All have an edge embedded value +.>。

Setting the trajectories of different targets as different nodes (e.g. N _oi And N _oj ) And only when N _oi And N _oj And the connection is performed when the corresponding condition is satisfied. In particular the number of the elements,and N _oj The connection of (2) needs to satisfy 3 conditions: (1) The distance between the center coordinates of the two nodes is smaller than a preset distance; (2) The cosine similarity between the features of the two nodes is greater than a cosine similarity threshold; (3) the cross-over ratio of the two nodes is greater than a cross-over ratio threshold. For each of the above conditions, a given N is selected _oj Number of (5) to be in charge of->And are connected without repeated connections. Since the connection between nodes is bi-directional, N _oi And N _oj The features are updated.

In addition, tracking is not possible in a short period of time because some tracks may become invisible because the target is completely occluded. These temporarily lost tracks are stored inIs then added to N in the graph _oi Is a kind of medium. In this process, the storage time is greater than +.>To prevent false positive conditions. Said->Maximum time indicating that the track is considered invisible, +.>Representing the shortest time that the track is considered invisible.

Then, the present application also introduces a binary variable for each edge in the graph. In the classical minimum component flow formula, the label of an edge connecting nodes that meet the following conditions simultaneously is defined as 1, provided that: (i) Satisfying three conditions (1) (2) (3) that the two nodes are connected; (ii) continuous in time within the track. The labels on all the remaining sides are defined as 0.

Specifically, the trajectory w _i Equivalent set of edgesCorresponding to the paths in the constructed graph arranged in sequence. Defining labels of edges based on the result, i.e. +.>Defining a binary variable +.>：

When (when)When in use, the side is->Is considered active. />For this, it is assumed that the trajectories in W are disjoint nodes, i.e. one node cannot belong to more than one trajectory. Therefore, y must satisfy a set of linear constraints, i.e. +.>The method comprises the following steps:

the inequality above shows that each node is connected to at most one node in the graph by an active edge and at most one node in the future trajectory graph, thereby completing the construction of the graph and obtaining an updated trajectory.

S4: and updating the graph based on the updated track, respectively aggregating the characteristics of the nodes connected in the adjacent past frames and the adjacent future frames of the current frame and the characteristics of the edges, embedding the characteristics of the nodes and the characteristics of the edges of the current frame as the characteristics of the nodes updated by the current frame and the characteristics of the edges, and executing S1 after adding one to the current frame until all frames in the video are processed, so as to obtain the multi-target tracking track.

The present application is based on a Message Passing Network (MPN) to achieve the overall viewAnd propagates and updates information contained in the characteristics of the edges and the characteristics of the nodes. The propagation process is divided into embedded updates of nodes and embedded updates of edges, called messaging steps. Wherein each messaging step is further divided into two update procedures: one is an update from edge to nodeThe other is update from node to edge +.>. The above updates are all performed sequentially, with the number of iterations S being fixed.

It is considered that in the actual updating process, after performing S iterations, each node contains information of all other nodes with a distance S in the graph. In the course of node and edge updates, it is also allowed to compare each node with its neighbors and aggregate information from all neighbors in order to update its embedded information to obtain more context information.

However, taking into account the linear constraints above, the constraints determine that each node in the graph can be connected to at most one node in the graph and another node in the future trajectory graph. Thus, aggregating embedded information for all neighboring nodes at once may make it difficult for updated node characteristics to capture whether these constraints are violated.

Thus, the present application breaks the aggregate into two parts to create a time-aware update rule: one is a future node and the other is a past node.

Specifically, use ofAnd->To represent nodes +.>Neighbor nodes in the t-1 st frame and the t+1 th frame. On the basis of this, two different perceptual functions are defined, namely +.>And->A perceptual function for the t+1st frame and a perceptual function for the t-1 st frame, respectively. In the iteration of the message passing step for s times, every node is +.>First calculate all its neighbors +.>Edge-to-node embedding of the t-1 st and t+1 th frames as follows:

wherein,for neighbor +.>Edge-to-node embedding->For initial embedded value, ++>Feature embedding for node of the s-1 th iteration,/->The feature embedding of the edge for the s-1 iteration ensures that the original features are not forgotten during the message passing process.

Then, the characteristics of the nodes and the characteristics of the edges of the t-1 frame and the t+1 frame are respectively aggregated and embedded into the preliminary tracking track, wherein an aggregation formula is as follows:

wherein,an aggregate embedded value for the t+1st frame, ">For the aggregate embedded value of the t-1 frame,for neighbor +.>Is embedded into the node.

Finally, updating the preliminary tracking track by combining a formula:

wherein,feature embedding for node of the s-th iteration, < >>As a learnable function。

By sensing the embedded information of the characteristics of the nodes and the edges of the adjacent past frames and the adjacent future frames of the current frame and respectively aggregating, more context information can be obtained, the continuity of the target track can be maintained, especially in complex scenes such as occlusion, camera movement and the like, the problems of target track loss and the like caused by occlusion or camera movement in an actual scene are solved, and the capturing capacity of the continuity of the target track is improved.

Fig. 3 is a schematic diagram of node updates during message delivery in a different manner in a constructed diagram. The arrow direction indicates the time direction, and the time is divided into t-1 frame, t frame, t+1 frame, and hasAnd->. Numerals 1-5 represent different nodes under different frames, wherein numeral 3 is a node of the t frame, numerals 1 and 2 are different neighbor nodes of the node of the t frame at the t-1 frame, and numerals 4 and 5 are different neighbor nodes of the node of the t frame at the t+1 frame. In addition, pentagonal boxes represent embedded information of neighbor nodes. The center plus circle pattern represents an aggregation of the different embedded information. Diamond represents a multi-layer sensor.

Fig. 3 (a) is an initial setting of node update in the message passing process, which means that only the embedded information of the neighbor node of the t-th frame is considered. (b) Updating for the prior art node represents the one-time aggregation of embedded information of all neighboring nodes. (c) For the node update of the present application, the embeddings representing the past and future frames are aggregated separately, then concatenated and input into the multi-layer perceptron to obtain a new node embedment.

In order to achieve multi-target tracking, the step S4 further includes processing the missed target in the image data of the t frame and the missed target in the image data of the continuous frame before obtaining the multi-target tracking track aiming at the target miss phenomenon.

Wherein, for the current frame imageThe missing targets in the image data are detected by adopting a single target tracking method to recover the missing targets in the image data of the t frame and the missing targets are combined with the target with high detection score D _score The bounding boxes restored by the single target tracking strategy.

While processing missed objects in successive frames of image data, a detection recovery strategy is proposed that utilizes a linear motion model to recover those missed objects. In particular to a special-shaped ceramic tile,

Suppose that if a target appears in the t-1 frame, the target is a normal target. Otherwise, the target is a missing target. The method willThe ith object in (a) is denoted as o _i D is to _t The j-th detection in (a) is denoted as d _j 。/>Representing a set of objects occurring before the t-th frame of image data, D _t Representing a set of objects to be detected. d, d _j And o _i The allocation status between them is denoted as a _i,j Wherein->Representing object o _i And detect d _j Associated, but->The opposite is indicated. Distribution and integration use->Representation, where |D _t I indicates the number of targets to be detected, < +.>Representing the original number of targets. The optimal allocation set may be represented as follows:

wherein,representing the optimal allocation set, +.>For the time minimum considered when storing the trajectory, to prevent false positive situations, σ represents a hyper-parameter, ++>And->Respectively represent the object o _i And detecting d _j Is characterized by the appearance of the (c) in terms of,representing object o _i And detecting d _j Cost of the two, the cost is the target o _i And detecting d _j Matching costs between the targets o _i And detect d _j Similarity or degree of matching between the two.

In addition, in the target detection recovery process, one detection can be associated with one target at most, and one target can be associated with one detection at most. The specific constraint formula is as follows:

meanwhile, according to the constraint formula, the following is allowed to exist:and->I.e. to detect the absence of a target of the current frame independent of the detected target.

In order to further accurately obtain the tracked target track, edge classification processing is performed on the multi-target tracking track. The step S4 of obtaining the multi-target tracking track further comprises performing edge classification on the multi-target tracking track, predicting an edge score, specifically,

Due to the nodeAt N _oj A number of nodes are connected, and the best matching is performed by using the hungarian algorithm based on the edge scoring matrix. Thus (S)>There is only one best matching edge score.

The edge score is used to evaluate the probability of whether the connected tracks belong to the same object within the time span. Such a prediction of scores facilitates trajectory matching in the graph.

The present application also provides a multi-target tracking system comprising:

connecting the two nodes through edges to obtain an updated track;

The application also provides multi-target tracking equipment, which comprises a processor and a memory, wherein the multi-target tracking method is realized when the processor executes the computer program stored in the memory.

The present application also provides a multi-target tracking storage medium for storing a computer program, wherein the computer program implements the multi-target tracking method when executed by a processor.

Claims

1. A multi-target tracking method, comprising the steps of:

connecting the two nodes through edges to obtain an updated track;

2. The multi-object tracking method according to claim 1, wherein the step of performing the detection association in S2 further comprises performing an optimization process on the bounding box features in the current frame image data, specifically,

3. The multi-target tracking method according to claim 1, wherein the detection correlation in S2 is, in particular,

4. The multi-target tracking method according to claim 1, wherein the obtaining the multi-target tracking trajectory in S4 further comprises edge classification of the multi-target tracking trajectory, prediction of an edge score, in particular,

5. The multi-target tracking method according to claim 1, wherein the step S4 further comprises detecting the missed target in the current frame image data by a single target tracking method before the multi-target tracking track is obtained.

6. The multi-target tracking method according to claim 1, wherein the step of S4 further comprises processing the missed target in the continuous frame image data, in particular,

7. The multi-object tracking method of claim 1 wherein the bounding box processing in S1 includes determining a height of the bounding box, a width of the bounding box, a center of the bounding box, a frame index.

8. A multi-target tracking system, comprising:

connecting the two nodes through edges to obtain an updated track;

9. A multi-target tracking device comprising a processor and a memory, wherein the processor implements the multi-target tracking method of any of claims 1-7 when executing a computer program stored in the memory.

10. A multi-target tracking storage medium storing a computer program, wherein the computer program when executed by a processor implements the multi-target tracking method of any of claims 1-7.