CN116681728A

CN116681728A - Multi-target tracking method and system based on Transformer and graph embedding

Info

Publication number: CN116681728A
Application number: CN202310686952.2A
Authority: CN
Inventors: 项俊; 刘登宇; 侯建华; 江小平
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-01

Abstract

The invention discloses a multi-target tracking method based on a transform and graph embedding, which comprises the following steps: acquiring a video sequence, and reading the video sequence frame by frame to acquire all frames in the video sequence; setting a counter cnt 1=1; judging whether the cnt1 is equal to the total number of frames in the video sequence, if not, inputting the cnt1 frame and the cnt1+1 frame in the video sequence into a multi-target tracking model trained in advance to obtain an allocation matrix between a target in the cnt1 frame and a corresponding target in the cnt1+1 frame, wherein the cnt1 frame is a previous frame, and the cnt+1 frame is a subsequent frame; and acquiring all targets associated with each target in the cnt1 frame in the cnt+1 frame according to the obtained allocation matrix between the target in the cnt1 frame and the corresponding target in the cnt1+1 frame, and forming the tracking track of the target by the target in the cnt1 frame and all the obtained targets. The method can solve the technical problems that the global correlation learning adopted by the existing transform-based multi-target tracking method not only increases the computational complexity, but also has excessive redundancy calculation.

Description

A multi-target tracking method and system based on Transformer and graph embedding

技术领域technical field

本发明属于模式识别技术领域，更具体地，涉及一种基于Transformer和图嵌入的多目标跟踪方法和系统。The invention belongs to the technical field of pattern recognition, and more particularly relates to a multi-target tracking method and system based on Transformer and graph embedding.

背景技术Background technique

多目标跟踪(Multi-Object Tracking，简称MOT)是计算机视觉领域的研究热题，在自动驾驶、视频监控、智能机器人等领域有着广泛的应用；多目标跟踪旨在给定视频序列中，标定出感兴趣目标位置，并检索出同一身份的运动轨迹，涉及目标检测和身份鉴定两个任务。Multi-Object Tracking (MOT) is a hot research topic in the field of computer vision, and has a wide range of applications in the fields of automatic driving, video surveillance, intelligent robots, etc. The position of the target of interest, and retrieve the trajectory of the same identity, involves two tasks of target detection and identification.

目前，随着Transformer网络的兴起，基于Transformer的多目标跟踪方法得到了广泛应用，该方法采用基于transformer的检测器(Detection Transformer，简称DETR)架构方法中的稀疏对象查询来检测新对象并初始化跟踪轨迹，利用跟踪查询来保持不同轨迹在跨帧中的信息，实现多帧之间的轨迹关联。这些工作背后的共同思想是使用DETR中的对象查询机制逐帧扩展现有轨迹，实现了目标的时域上的迁移传播。At present, with the rise of the Transformer network, the Transformer-based multi-target tracking method has been widely used, which uses the sparse object query in the Transformer-based detector (Detection Transformer, referred to as DETR) architecture method to detect new objects and initialize tracking. Trajectory, using tracking query to maintain the information of different trajectories across frames, and realize the trajectory association between multiple frames. The common idea behind these works is to expand existing trajectories frame-by-frame using the object query mechanism in DETR, which enables migration propagation of objects in the temporal domain.

然而，上述现有基于Transformer的多目标跟踪方法存在一些不可忽略的缺陷：However, the above-mentioned existing Transformer-based multi-target tracking methods have some defects that cannot be ignored:

第一，Transformer模型在视觉领域的成功归结于其基于注意力机制的全局相关性建模特性，但这也导致了Transformer架构本身的固有缺陷，全局相关性学习不仅增加计算复杂度，而且存在过度冗余计算；First, the success of the Transformer model in the visual field is due to its global correlation modeling feature based on the attention mechanism, but this also leads to the inherent defects of the Transformer architecture itself. The global correlation learning not only increases the computational complexity, but also has excessive redundant calculation;

第二，Transformer一体化跟踪模型来源于检测任务，强调建模长时空域跨度上的场景结构相关性，模型学习自由，学习过程缺少可解释性，对于邻域局部的信息学习没有明确的优先级；Second, the Transformer integrated tracking model is derived from the detection task, emphasizing the correlation of scene structure in modeling long-term spatial domain span, the model learning is free, the learning process lacks interpretability, and there is no clear priority for neighborhood local information learning ;

第三，Transformer一体化MOT框架采用基于查询排序的方式间接实现跟踪目标与检测响应的身份关联问题，查询目标的关联输出并没有考虑其他查询目标与检测目标的相似性结果，也即测试阶段没有明确多目标数据关联的最优推理，因此关联结果往往是次优的，这会导致困难复杂场景中目标信息的失真，基于编码查询排序的输出结果不可信，进而导致跟踪过程中身份切换(Identity Switches，简称IDS)指标比较差。Third, the Transformer integrated MOT framework adopts a method based on query ranking to indirectly realize the identity association between the tracking target and the detection response. The correlation output of the query target does not consider the similarity results of other query targets and detection targets, that is, there is no Clarify the optimal reasoning of multi-target data association, so the association results are often suboptimal, which will lead to the distortion of target information in difficult and complex scenes, and the output results based on encoding query sorting are not credible, which in turn leads to identity switching in the tracking process (Identity Switches, referred to as IDS) indicators are relatively poor.

发明内容Contents of the invention

针对现有技术的以上缺陷或改进需求，本发明提供了一种基于Transformer和图嵌入的多目标跟踪方法和系统，其目的在于，解决现有基于Transformer的多目标跟踪方法采用的全局相关性学习计算复杂度高、存在过度冗余计算的技术问题，以及学习过程缺少可解释性，对于邻域局部的信息学习没有明确的优先级的技术问题，以及跟踪过程中IDS指标比较差的技术问题。。In view of the above defects or improvement needs of the prior art, the present invention provides a multi-target tracking method and system based on Transformer and graph embedding, the purpose of which is to solve the global correlation learning adopted by the existing Transformer-based multi-target tracking method The technical problems of high computational complexity, excessive redundant calculations, lack of interpretability in the learning process, no clear priority for neighborhood local information learning, and poor IDS indicators in the tracking process. .

为实现上述目的，按照本发明的一个方面，提供了一种基于Transformer和图嵌入的多目标跟踪方法，包括：To achieve the above object, according to one aspect of the present invention, a multi-target tracking method based on Transformer and graph embedding is provided, including:

(1)获取视频序列，对视频序列逐帧读取，以获取其中的所有帧；(1) Obtain a video sequence, and read the video sequence frame by frame to obtain all frames therein;

(2)设置计数器cnt1＝1；(2) setting counter cnt1=1;

(3)判断cnt1是否已经等于视频序列中的帧总数，如果是则过程结束，否则进入步骤(4)；(3) judge whether cnt1 has been equal to the total number of frames in the video sequence, if so the process ends, otherwise enter step (4);

(4)将视频序列中的第cnt1帧和第cnt1+1帧入到预先训练好的多目标跟踪模型中，以得到第cnt1帧中的目标和第cnt1+1帧中的对应目标间的分配矩阵，其中第cnt帧为前一帧，第cnt+1帧是后一帧；(4) Input the cnt1th frame and the cnt1+1th frame in the video sequence into the pre-trained multi-target tracking model to obtain the distribution between the target in the cnt1th frame and the corresponding target in the cnt1+1th frame Matrix, where the cntth frame is the previous frame, and the cnt+1th frame is the next frame;

(5)根据步骤(4)得到的第cnt1帧中的目标和第cnt1+1帧中对应目标间的分配矩阵获取第cnt+1帧中与第cnt1帧中的每个目标关联的所有目标，将第cnt帧中的该目标与得到的所有目标一起构成该目标的跟踪轨迹；(5) According to the allocation matrix between the target in the cnt1 frame obtained in step (4) and the corresponding target in the cnt1+1 frame, obtain all targets associated with each target in the cnt+1 frame, Constitute the target in the cnt frame together with all the obtained targets to track the target;

(6)设置计数器cnt1＝cnt1+1，并返回步骤(3)。(6) Set the counter cnt1=cnt1+1, and return to step (3).

优选地，多目标跟踪模型包含依次连接的基准深度视觉特征提取网络、Transformer编码器、Transformer解码器、图卷积网络GCN、以及关联网络。Preferably, the multi-target tracking model includes a benchmark deep visual feature extraction network, a Transformer encoder, a Transformer decoder, a graph convolutional network GCN, and an association network connected in sequence.

优选地，第一层是基准深度视觉特征提取网络，其包含1个卷积层、16个积木结构及1个全连接层，基准深度视觉特征提取网络的输入为前一帧和后一帧，输出为前一帧对应的、维度为h×w×C的基准深度视觉特征以及后一帧对应的、维度为h×w×C的基准深度，其中h表示基准深度视觉特征的高度，w表示基准深度视觉特征的高度，C表示基准深度视觉特征的通道数，且有C＝2048。Preferably, the first layer is a benchmark deep visual feature extraction network, which includes 1 convolutional layer, 16 building block structures and 1 fully connected layer, and the input of the benchmark deep visual feature extraction network is the previous frame and the next frame, The output is the reference depth visual feature corresponding to the previous frame with a dimension of h×w×C and the reference depth corresponding to the next frame with a dimension of h×w×C, where h represents the height of the visual feature of the reference depth, and w represents The height of the reference depth visual feature, C represents the number of channels of the reference depth visual feature, and C=2048.

第二层是Transformer编码器，其输入为前一帧和后一帧通过基准深度视觉特征提取网络得到的基准深度视觉特征,通过L层的Transformer编码，输出为前一帧对应的、维度为256维的全局特征和后一帧对应的、维度为256维的全局特征，其中L的取值范围是3到10。The second layer is the Transformer encoder, whose input is the reference depth visual feature obtained through the reference depth visual feature extraction network of the previous frame and the next frame, and is encoded by the Transformer of the L layer, and the output is corresponding to the previous frame, with a dimension of 256 The global feature of dimension and the global feature of dimension 256 corresponding to the next frame, where the value range of L is 3 to 10.

第三层是Transformer解码器，针对前一帧而言，其输入为对象查询和Transformer编码器得到的全局特征，首先通过互注意力机制得到N个输出嵌入，然后通过前馈神经网络得到N个框坐标和类标签；针对后一帧而言，其输入为对象查询与前一帧得到的N个输出嵌入级联后的结果和第二层得到的全局特征，首先通过互注意力机制，得到M个输出嵌入然后通过前馈神经网络得到M个框坐标以及类标签；The third layer is the Transformer decoder. For the previous frame, its input is the object query and the global features obtained by the Transformer encoder. First, N output embeddings are obtained through the mutual attention mechanism, and then N output embeddings are obtained through the feedforward neural network. Frame coordinates and class labels; for the next frame, the input is the result of the object query and the N output embeddings obtained in the previous frame and the global features obtained in the second layer. First, through the mutual attention mechanism, get M output embeddings and then get M box coordinates and class labels through the feedforward neural network;

第四层是图卷积网络，其输入为第三层得到的前一帧的N个输出嵌入和后一帧的M个输出嵌入，首先通过确定近邻关系，分别得到的前一帧级联后的输出嵌入以及后一帧级联后的输出嵌入，然后通过图卷积网络分别对得到的前一帧级联后的输出嵌入以及后一帧级联后的输出嵌入进行迭代处理，分别输出前一帧的局部相关性增强特征以及后一帧的局部相关性增强特征；The fourth layer is a graph convolutional network, whose input is the N output embeddings of the previous frame and the M output embeddings of the next frame obtained by the third layer. First, by determining the neighbor relationship, the previous frames obtained respectively are concatenated. The output embedding of the concatenated output of the previous frame and the output embedding of the next frame concatenated, and then iteratively process the output embedding of the concatenated previous frame and the output embedding of the concatenated next frame through the graph convolutional network, and output the former The local correlation enhanced features of one frame and the local correlation enhanced features of the next frame;

第五层是关联网络，其输入为第四层得到的前一帧的局部相关性增强特征的和后一帧的局部相关性增强特征，对二者进行基于特征间的关联处理，以得到前一帧和后一帧对应目标间的分配矩阵。The fifth layer is the association network, whose input is the local correlation enhanced feature of the previous frame and the local correlation enhanced feature of the next frame obtained by the fourth layer, and the two are processed based on the correlation between the features to obtain the previous frame. One frame and the next frame correspond to the assignment matrix between objects.

优选地，多目标跟踪模型是通过以下步骤训练得到的：Preferably, the multi-target tracking model is trained through the following steps:

(4-1)获取MOT17数据集和CrowdHuman图像数据库，将获取的MOT17数据集按照等比划分为前半段和后半段，将得到的MOT17数据集的前半段和整个CrowdHuman图像数据库用作训练集，将得到的MOT17数据集的后半段用作测试集。(4-1) Obtain the MOT17 data set and the CrowdHuman image database, divide the obtained MOT17 data set into the first half and the second half according to the proportion, and use the first half of the obtained MOT17 data set and the entire CrowdHuman image database as a training set , using the second half of the resulting MOT17 dataset as the test set.

(4-2)设置计数器cnt3＝1；(4-2) setting counter cnt3=1;

(4-3)设置计数器cnt2＝1，判断cnt3是否等于训练集中视频序列的总数，如果是进入步骤(4-16)，否则进入步骤(4-4)；(4-3) counter cnt2=1 is set, judge whether cnt3 is equal to the total number of video sequences in the training set, if enter step (4-16), otherwise enter step (4-4);

(4-4)判断cnt2是否等于训练集中第cnt3个视频序列中的帧总数，如果是则转入步骤(4-14)，否则进入步骤(4-5)；(4-4) judge whether cnt2 is equal to the total number of frames in the cnt3 video sequence in the training set, if it is then proceed to step (4-14), otherwise enter step (4-5);

(4-5)初始化N_obj个可学习的对象查询(objectqueries)分别作为前一帧和后一帧的对象查询，其中N_obj＝500；(4-5) Initialize N _obj learnable object queries (objectqueries) as the object queries of the previous frame and the next frame respectively, wherein N _obj =500;

(4-6)将训练集中第cnt3个视频序列中的第cnt2帧作为前一帧，将第cnt2+1帧作为后一帧，将前一帧和后一帧分别输入基准深度视觉特征提取网络，以得到与前一帧对应的、维度为2048的基准深度视觉特征I₁，以及后一帧对应的、维度为2048的基准深度视觉特征I₂。(4-6) Use the cnt2 frame in the cnt3 video sequence in the training set as the previous frame, use the cnt2+1 frame as the next frame, and input the previous frame and the next frame into the benchmark depth visual feature extraction network respectively , to obtain the reference depth visual feature I ₁ corresponding to the previous frame with a dimension of 2048, and the reference depth visual feature I ₂ with a dimension of 2048 corresponding to the next frame.

(4-7)针对步骤(4-6)获取的训练集中第cnt3个视频序列中的前一帧和后一帧而言，将步骤(4-6)得到的与前一帧对应的基准深度视觉特征I₁、以及与后一帧对应的基准深度视觉特征I₂作为Transformer编码器的查询、键和值，分别输入到Transformer编码器的多头自注意层，以分别得到维度为256的特征向量，分别将256维的特征向量进行层归一化，以分别得到归一化后的特征向量，并分别将归一化后的特征输入Transformer编码器的前馈神经网络并归一化，以分别得到与前一帧对应的512维全局特征F₁，以及与后一帧对应的512维全局特征F₂。(4-7) For the previous frame and the next frame in the cnt3th video sequence in the training set obtained in step (4-6), the reference depth corresponding to the previous frame obtained in step (4-6) The visual feature I ₁ and the reference depth visual feature I ₂ corresponding to the next frame are used as the query, key and value of the Transformer encoder, and are respectively input to the multi-head self-attention layer of the Transformer encoder to obtain feature vectors with a dimension of 256 , respectively normalize the 256-dimensional feature vectors to obtain the normalized feature vectors respectively, and input the normalized features into the feedforward neural network of the Transformer encoder and normalize them respectively to obtain the normalized feature vectors respectively A 512-dimensional global feature F ₁ corresponding to the previous frame and a 512-dimensional global feature F ₂ corresponding to the next frame are obtained.

(4-8)针对步骤(4-6)获取的训练集中第cnt3个视频序列中的前一帧而言，将前一帧的对象查询输入到多头自注意层和层归一化层，得到维度为256维的对象查询，将步骤(4-7)得到的前一帧对应的全局特征F₁作为Transformer解码器的键(Key)和值(Value)、256维的对象查询作为查询，输入到Transformer解码器的多头互注意力层并归一化，以获取N个前一帧的输出嵌入，将输出嵌入输入前馈神经网络并归一化，以得到512维的特征向量，并将512维的特征向量输入到前馈神经网络中，以得到前一帧的N个跟踪轨迹，每个跟踪轨迹包括框坐标和类标签；(4-8) For the previous frame in the cnt3th video sequence in the training set obtained in step (4-6), input the object query of the previous frame to the multi-head self-attention layer and the layer normalization layer to obtain For an object query with a dimension of 256, the global feature F ₁ corresponding to the previous frame obtained in step (4-7) is used as the key (Key) and value (Value) of the Transformer decoder, and the 256-dimensional object query is used as a query, input Go to the multi-head mutual attention layer of the Transformer decoder and normalize to obtain the output embedding of the N previous frames, and then embed the output into the feedforward neural network and normalize to obtain a 512-dimensional feature vector, and convert the 512 The feature vector of dimension is input into the feed-forward neural network to obtain N tracking trajectories of the previous frame, each tracking track includes frame coordinates and class labels;

(4-9)针对步骤(4-6)获取的训练集中第cnt3个视频序列中的后一帧而言，先将该后一帧的跟踪查询和对象查询级联后输入到多头自注意层和层归一化层，以得到维度为256维的级联查询，将级联查询作为查询、步骤(4-7)得到的全局特征F₂作为键(Key)和值(Value)输入到Transformer解码器的多头互注意力层并归一化，得到M个后一帧的输出嵌入，将输出嵌入输入到前馈神经网络并归一化，以得到512维的特征向量，并将512维的特征向量输入到前馈神经网络中，以得到后一帧的M个检测目标，每个检测目标包括框坐标以及类标签；(4-9) For the next frame in the cnt3th video sequence in the training set obtained in step (4-6), the tracking query and object query of the latter frame are first cascaded and then input to the multi-head self-attention layer and layer normalization layer to obtain a cascaded query with a dimension of 256 dimensions, and use the cascaded query as a query, and the global feature F ₂ obtained in steps (4-7) as a key (Key) and value (Value) to be input to the Transformer The multi-head mutual attention layer of the decoder is normalized to obtain M output embeddings of the next frame, and the output embeddings are input to the feedforward neural network and normalized to obtain a 512-dimensional feature vector, and the 512-dimensional The feature vector is input into the feedforward neural network to obtain M detection targets in the next frame, each detection target includes frame coordinates and class labels;

(4-10)为步骤(4-8)得到的N个跟踪轨迹和步骤(4-9)得到的M个检测目标分别建立图模型G₁、G₂；(4-10) Establish graph models G 1 , G ₂ for the N tracking trajectories obtained in step (4-8) and the M detection targets obtained in step ( _4-9 );

(4-11)将步骤(4-10)得到的前一帧和后一帧图模型G₁、G₂，分别输入到图卷积网络中，通过图卷积网络迭代次数的增加，得到前一帧的局部相关性增强特征E₁和后一帧的局部相关性增强特征E₂。(4-11) Input the graph models G ₁ and G ₂ of the previous frame and the next frame obtained in step (4-10) into the graph convolutional network respectively, and by increasing the number of iterations of the graph convolutional network, the former The local correlation enhanced feature E ₁ of one frame and the local correlation enhanced feature E ₂ of the next frame.

(4-12)将步骤(4-11)得到前一帧局部相关性增强特征E₁和后一帧局部相关性增强特征E₂输入到关联网络(如图8)中，对其进行基于特征间的关联处理，以得到前一帧的跟踪轨迹和后一帧的检测目标之间的分配矩阵；(4-12) Input the local correlation enhancement feature E ₁ of the previous frame and the local correlation enhancement feature E ₂ of the next frame obtained in step (4-11) into the association network (as shown in Figure 8), and perform feature-based Correlation processing between to obtain the allocation matrix between the tracking track of the previous frame and the detection target of the next frame;

(4-13)利用步骤(4-8)得到的N个框坐标和类标签、以及步骤(4-9)得到的M个框坐标和类标签对Transformer编码器、Transformer解码器、图卷积网络以及关联网络进行训练，根据Transformer编码器、Transformer解码器和GCN为步骤(4-6)获取的训练集中的前一帧和后一帧得到前一帧跟踪轨迹的框坐标、类标签以及局部相关性增强特征和后一帧检测目标的框坐标、类标签以及局部相关性增强特征，根据关联网络获取前一帧的跟踪轨迹和后一帧的检测目标之间的分配矩阵，通过得到框坐标、类标签和分配矩阵输入到定义的损失函数中，以得到检测损失、跟踪损失和focal损失；(4-13) Use the N frame coordinates and class labels obtained in step (4-8), and the M frame coordinates and class labels obtained in step (4-9) to Transformer encoder, Transformer decoder, graph convolution Network and associated network for training, according to the previous frame and the next frame in the training set obtained by the Transformer encoder, Transformer decoder and GCN for step (4-6), the frame coordinates, class labels and local Correlation enhancement features and the frame coordinates, class labels, and local correlation enhancement features of the detection target in the next frame, the distribution matrix between the tracking trajectory of the previous frame and the detection target of the next frame is obtained according to the association network, and the frame coordinates are obtained by , class label and assignment matrix are input into the defined loss function to get detection loss, tracking loss and focal loss;

(4-14)cnt2＝cnt2+1，并返回步骤(4-4)；(4-14) cnt2=cnt2+1, and return to step (4-4);

(4-15)cnt3＝cnt3+1，并返回步骤(4-3)；(4-15) cnt3=cnt3+1, and return to step (4-3);

(4-16)根据步骤(4-13)获取的检测损失、跟踪损失以及focal损失、并利用反向传播方法对多目标跟踪模型进行迭代训练，直到该多目标跟踪模型收敛为止，从而获取初步训练好的多目标跟踪模型。(4-16) According to the detection loss, tracking loss and focal loss obtained in step (4-13), and use the backpropagation method to iteratively train the multi-target tracking model until the multi-target tracking model converges, thereby obtaining a preliminary A trained multi-object tracking model.

(4-17)使用步骤(4-1)获取的测试集对步骤(4-16)初步训练好的多目标跟踪模型进行验证，直到获取的跟踪精度达到最优为止，从而获取训练好的多目标跟踪模型。(4-17) Use the test set obtained in step (4-1) to verify the multi-target tracking model preliminarily trained in step (4-16) until the obtained tracking accuracy reaches the optimal level, so as to obtain the multi-target tracking model trained Object tracking model.

优选地，步骤(4-10)具体为，首先确定顶点集，顶点集合V1＝(v1、v2…vN)是由前一帧中所有跟踪轨迹组成，顶点集合V2＝(v1、v2…vM)是由后一帧中所有检测目标组成，顶点值分别为对应的前一帧的跟踪轨迹的输出嵌入和后一帧的检测目标的输出嵌入，然后分别确定顶点集合V1、V2中所有顶点的邻居节点以构成边，最后得到图模型G₁、G₂。Preferably, the step (4-10) is specifically, firstly determine the vertex set, the vertex set V1=(v1, v2...vN) is composed of all tracking tracks in the previous frame, the vertex set V2=(v1, v2...vM) It is composed of all detection targets in the next frame, and the vertex values are the output embedding of the corresponding tracking trajectory of the previous frame and the output embedding of the detection target of the next frame, and then determine the neighbors of all vertices in the vertex sets V1 and V2 respectively Nodes form edges, and finally graph models G ₁ and G ₂ are obtained.

优选地，步骤(4-10)确定V1、V2中所有顶点的邻居节点过程如下，首先获取V1集合中所有顶点对应的框坐标和输出嵌入，然后根据V1确定的框坐标获取两个顶点对应的框坐标之间的交并比IoU以及输出嵌入获取两个顶点特征之间的余弦相似度，并选出每个顶点与除自身之外的所有顶点之间权重聚集系数最大的前K个作为其邻居节点，以构成边，并将/>作为顶点与邻居节点之间边的权重，确定V2中所有顶点的邻居节点过程与V1一样。Preferably, the process of determining the neighbor nodes of all vertices in V1 and V2 in step (4-10) is as follows, first obtain the frame coordinates and output embeddings corresponding to all vertices in the V1 set, and then obtain the two vertices corresponding to the frame coordinates determined by V1 The intersection ratio IoU between the frame coordinates and the output embedding obtain the cosine similarity between the two vertex features, and select the weight aggregation coefficient between each vertex and all vertices except itself The largest top K as its neighbor nodes to form an edge, and /> As the weight of the edges between vertices and neighbor nodes, the process of determining the neighbor nodes of all vertices in V2 is the same as that of V1.

顶点与顶点之间边的权重为聚集系数表示顶点之间的相关性，定义为顶点i和顶点j的特征及空间距离相似性：The weight of the edge between vertices is the clustering coefficient Represents the correlation between vertices, defined as the feature and spatial distance similarity of vertex i and vertex j:

cos(·,·)表示特征向量间的余弦相似度，IoU(·,·)表示两个目标位置坐标的交并比IoU，l表示GCN迭代层数，表示顶点i在第l层GCN的顶点特征，b_i表示顶点i的框坐标，i∈[1，M]，j∈[1，N]。cos(·,·) represents the cosine similarity between feature vectors, IoU(·,·) represents the intersection and union ratio IoU of two target position coordinates, l represents the number of GCN iteration layers, Indicates the vertex feature of vertex i in the l-level GCN, b _i indicates the frame coordinates of vertex i, i∈[1,M], j∈[1,N].

优选地，步骤(4-11)具体为，首先分别构建步骤(4-10)得到的G₁和G₂中每个顶点的特征矩阵，每个顶点与其邻居节点总数为Num＝K+1个，通过级联顶点与邻居节点的输出嵌入得到特征矩阵Ft，Ft∈R^Num×d即为该顶点的特征矩阵，其中d为每个节点特征维度，维度为256。然后，分别得到G₁和G₂中每个顶点与其所有邻居节点的二维邻接矩阵Mt∈R^Nu×Num，设目标节点索引为1，则二维邻接矩阵初始化定义为:Preferably, the step (4-11) is specifically, first constructing the feature matrix of each vertex in G ₁ and G ₂ obtained in the step (4-10), the total number of each vertex and its neighbor nodes is Num=K+1 , the feature matrix Ft is obtained by cascading the output embedding of the vertex and the neighbor node, Ft∈R ^Num×d is the feature matrix of the vertex, where d is the feature dimension of each node, and the dimension is 256. Then, get the two-dimensional adjacency matrix Mt∈R ^Nu×Num of each vertex in _G1 and _G2 and all its neighbor nodes respectively, set the target node index as 1, then the two-dimensional adjacency matrix initialization is defined as:

其中，i,j∈{1,2,…,Num}，Mt＝1表示顶点i和顶点j之间有边相连，Num表示目标节点与其邻居节点的总数。随后将G₁和G₂中每个顶点输入到图卷积网络中，以得到聚合特征，最后将聚合特征输入到前馈神经网络中，以得到前一帧的局部相关性增强特征E₁和后一帧的局部相关性增强特征E₂。Among them, i, j∈{1,2,...,Num}, Mt=1 means that there is an edge connecting vertex i and vertex j, and Num means the total number of the target node and its neighbor nodes. Then each vertex in _G1 and _G2 is input into the graph convolutional network to obtain aggregated features, and finally the aggregated features are input into the feedforward neural network to obtain the local correlation enhanced features _E1 and The local correlation of the next frame enhances the feature E ₂ .

优选地，步骤(4-12)具体为，首先，在使用关联网络进行数据关联之前，需要构建一个特征矩阵，由于每个目标的外观特征包含了高维度的信息，为了充分利用这些丰富的特征信息，本发明采用了将步骤(4-11)得到的前一帧局部相关性增强特征E₁和后一帧局部相关性增强特征E₂进行两两拼接的方法，以得到特征矩阵E。Preferably, the step (4-12) is specifically, first, before using the association network for data association, it is necessary to construct a feature matrix, since the appearance features of each target contain high-dimensional information, in order to make full use of these rich features information, the present invention adopts the method of splicing the local correlation enhanced feature E ₁ of the previous frame and the local correlation enhanced feature E ₂ of the next frame obtained in step (4-11) to obtain the feature matrix E.

然后将特征矩阵E送入到三个不同的1*1卷积层中，并经过Reshape操作压缩到MN×2C维度，分别得到Q∈R^MN×2C，K∈R^MN×2C，V∈R^MN×2C，其中C＝256，将Q和K进行矩阵乘法，得到两两跟踪目标之间的相关性，再通过Softmax操作后得到E_Q,K，再将E_Q,K与V进行一次矩阵乘法，以得到进一步的关联矩阵。Then the feature matrix E is sent to three different 1*1 convolutional layers, and compressed to the MN×2C dimension through the Reshape operation to obtain Q∈R ^MN×2C , K∈R ^MN×2C , V∈R ^MN×2C , where C=256, perform matrix multiplication on Q and K to obtain the correlation between two tracking targets, and then obtain E _{Q, K} after Softmax operation, and then perform a matrix on E _{Q, K} and V multiplication to get a further incidence matrix.

最后将该矩阵送入到前馈网络层并且通过两次残差连接，得到前一帧跟踪轨迹和后一帧检测目标更深层次的相关性特征，最终得到分配矩阵A∈R^M×N。Finally, the matrix is sent to the feed-forward network layer and through two residual connections, the tracking trajectory in the previous frame and the deeper correlation features of the detection target in the next frame are obtained, and the assignment matrix A∈RM ^×N is finally obtained.

优选地，步骤(4-13)中检测损失、跟踪损失和focal损失获得步骤如下，首先获得后一帧中新检测的目标以及后一帧关联到前一帧的目标，将T_t-1和T_t分别代表前一帧跟踪轨迹和后一帧的检测目标,当任意轨迹trk∈T_t/T_t-1时，即对应T_t中的新目标但不属于T_t-1，然后将后一帧中新检测的目标的框输入到坐标和类标签输入到检测损失函数，以得到检测损失；将后一帧关联到前一帧的目标的框坐标和类标签输入到跟踪损失函数里，以得到跟踪损失。随后，将步骤(4-12)得到的前一帧的跟踪轨迹和后一帧的检测目标之间的分配矩阵输入到focal损失函数里，以得到focal损失。Preferably, the detection loss, tracking loss and focal loss acquisition steps in step (4-13) are as follows, first obtain the newly detected target in the subsequent frame and the target associated with the subsequent frame to the previous frame, and T _t-1 and T _t represents the tracking trajectory of the previous frame and the detection target of the next frame respectively. When any trajectory trk∈T _t /T _t-1 , it corresponds to a new target in T _t but does not belong to T _t-1 , and then the following The frame of the newly detected target in one frame is input to the coordinates and class label to the detection loss function to obtain the detection loss; the frame coordinates and class labels of the target associated with the previous frame are input into the tracking loss function, to get the tracking loss. Subsequently, the distribution matrix between the tracking trajectory of the previous frame and the detection target of the next frame obtained in step (4-12) is input into the focal loss function to obtain the focal loss.

检测损失函数定义如下：The detection loss function is defined as follows:

L_det＝λ_clsL_cls+λ_L1L_box+λ_giouL_giou L _det ＝λ _cls L _cls +λ _L1 L _box +λ _giou L _giou

其中L_cls是基于新检测目标的类标签与训练集真值类标签之间的focal loss，L_box和L_giou是计算归一化后的框坐标与训练集真值框坐标之间的L1距离和广义IoU而λ_cls、λ_L1、λ_giou为分别对应各部分的权重参数；Among them, L _cls is the focal loss between the class label based on the new detection target and the true value class label of the training set, and L _box and L _giou are the L1 distance between the normalized frame coordinates and the true value box coordinates of the training set and generalized IoU and λ _cls , λ _L1 , λ _giou are the weight parameters corresponding to each part;

跟踪损失函数L_trk如下:The tracking loss function L _trk is as follows:

L_trk＝λ_clsL_cls+λ_L1L_box+λ_giouL_giou L _trk ＝λ _cls L _cls +λ _L1 L _box +λ _giou L _giou

focal损失函数等于：The focal loss function is equal to:

其中focal损失在交叉熵损失函数的基础上添置了调制系数γ和平衡因子α，调制系数γ取值范围是1到5，优选为2，α取值范围是0.1-0.7，优选为0.5，这样可以减少负样本分类的损失，使损失权重更关注正样本的分类，y＝1表示成功匹配，y＝0表示未匹配。Among them, the focal loss adds a modulation coefficient γ and a balance factor α on the basis of the cross-entropy loss function. The value range of the modulation coefficient γ is 1 to 5, preferably 2, and the value range of α is 0.1-0.7, preferably 0.5. The loss of negative sample classification can be reduced, so that the loss weight pays more attention to the classification of positive samples, y=1 means a successful match, and y=0 means no match.

按照本发明的另一方面，提供了一种基于Transformer和图嵌入的多目标跟踪系统，包括：According to another aspect of the present invention, a multi-target tracking system based on Transformer and graph embedding is provided, including:

第一模块，用于获取视频序列，对视频序列逐帧读取，以获取其中的所有帧；The first module is used to obtain a video sequence, and reads the video sequence frame by frame to obtain all frames therein;

第二模块，用于设置计数器cnt1＝1；The second module is used to set the counter cnt1=1;

第三模块，用于判断cnt1是否已经等于视频序列中的帧总数，如果是则过程结束，否则进入第四模块；The third module is used to judge whether cnt1 has been equal to the total number of frames in the video sequence, if so, the process ends, otherwise enter the fourth module;

第四模块，用于将视频序列中的第cnt1帧和第cnt1+1帧入到预先训练好的多目标跟踪模型中，以得到第cnt1帧中的目标和第cnt1+1帧中的对应目标间的分配矩阵，其中第cnt帧为前一帧，第cnt+1帧是后一帧；The fourth module is used to input the cnt1th frame and the cnt1+1th frame in the video sequence into the pre-trained multi-target tracking model to obtain the target in the cnt1th frame and the corresponding target in the cnt1+1th frame The allocation matrix between, where the cntth frame is the previous frame, and the cnt+1th frame is the next frame;

第五模块，用于根据第四模块得到的第cnt1帧中的目标和第cnt1+1帧中对应目标间的分配矩阵获取第cnt+1帧中与第cnt1帧中的每个目标关联的所有目标，将第cnt帧中的该目标与得到的所有目标一起构成该目标的跟踪轨迹；The fifth module is used to obtain all the objects associated with each target in the cnt1th frame in the cnt+1th frame according to the allocation matrix between the target in the cnt1th frame obtained by the fourth module and the corresponding target in the cnt1+1th frame Target, the target in the cnt frame and all the obtained targets constitute the tracking track of the target;

第六模块，用于设置计数器cnt1＝cnt1+1，并返回第三模块。The sixth module is used to set the counter cnt1=cnt1+1, and return to the third module.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，能够取得下列有益效果：Generally speaking, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects:

(1)由于本发明采用了步骤(4-7)到步骤(4-11)，将GCN嵌入到Transformer一体化MOT架构，增强编码空间中近邻多目标的时空域拓扑信息学习有效性，同时构建适用线性规划求解的关联网络，解决Transformer一体化架构在测试过程中缺少多目标身份匹配全局推理的缺陷。本发明所提模型结合Transformer及GCN模型优点，将MOT中目标检测、特征表达以及数据关联推理统一嵌入到深度网络模型中，极大的发挥了多任务联合优化学习的优势，同时改善模型收敛性，因此，本发明能够解决现有Transformer模型固有缺陷，全局相关性学习不仅增加计算复杂度，而且存在过度冗余计算而导致的收敛速度慢这一问题。(1) Since the present invention adopts steps (4-7) to steps (4-11), the GCN is embedded into the Transformer integrated MOT architecture, which enhances the effectiveness of space-time domain topology information learning of adjacent multi-targets in the coding space, and simultaneously constructs Apply the association network for linear programming to solve the defect of Transformer integrated architecture that lacks global reasoning for multi-objective identity matching during the test process. The model proposed in the present invention combines the advantages of Transformer and GCN models, and embeds the target detection, feature expression and data association reasoning in MOT into the deep network model, which greatly exerts the advantages of multi-task joint optimization learning and improves model convergence. , therefore, the present invention can solve the inherent defect of the existing Transformer model, the global correlation learning not only increases the computational complexity, but also has the problem of slow convergence caused by excessive redundant calculation.

(2)由于本发明采用了步骤(4-10)到步骤(4-11)，相邻帧间多目标间存在特定的组群运动规律，一方面构成近邻的目标群体既存在一致性的运动模式，又存在固定的邻居关系。这种社会组群特性体现的是运动目标间的局部相关性，作为一种先验信息，可以为复杂场景中目标检测、跟踪提供有效依据，因此本发明能够解决Transformer模型学习自由，学习过程缺少可解释性，对于邻域局部的信息学习没有明确的优先级这一问题。(2) Since the present invention adopts steps (4-10) to steps (4-11), there is a specific group motion rule among multiple targets between adjacent frames, and on the one hand, there is a consistent motion between the target groups that constitute the neighbors. mode, and there is a fixed neighbor relationship. This social group feature reflects the local correlation between moving targets. As a priori information, it can provide an effective basis for target detection and tracking in complex scenes. Therefore, the present invention can solve the problem of Transformer model learning freedom and lack of learning process. Interpretability, the problem of no clear priority for neighborhood-local information learning.

(3)由于本发明采用了步骤(4-12)，受Transformer架构“attention is all yourneed”模型启发，针对多目标跟踪数据关联问题，提出关联网络，实现基于全注意力机制的线性规划网络模型，以此将数据关联形式化神经网络可实现形式嵌入到一体化MOT学习模型中,这种自主学习场景中适用于数据关联优化问题的特征表达及度量问题，进一步提升模型的有效特性。从而能够解决困难复杂场景中目标信息失真，导致关联结果次优这一问题。(3) Since the present invention adopts steps (4-12), inspired by the "attention is all your need" model of the Transformer architecture, for the multi-target tracking data association problem, an association network is proposed to realize a linear programming network model based on a full attention mechanism In order to embed the data association formalized neural network into the integrated MOT learning model, this autonomous learning scenario is suitable for the feature expression and measurement of the data association optimization problem, and further improves the effective characteristics of the model. In this way, it can solve the problem that the target information is distorted in difficult and complex scenes, resulting in suboptimal association results.

(4)由于本发明步骤(4-12)中关联网络也可以嵌入到神经网络中，因此能够利用深度学习强大的学习能力，根据提高数据关联模型和状态推理模型的内在联系，进一步提高数据关联模型和状态推理模型的性能；(4) Since the association network in step (4-12) of the present invention can also be embedded in the neural network, it is possible to use the powerful learning ability of deep learning to further improve the data association according to the internal connection between the data association model and the state reasoning model. performance of models and state inference models;

(5)本发明提出了一个基于Transformer模型和图嵌入的框架，它同时结合了Transformer和GCN的优势用于多目标跟踪问题，因此可以利用反向传播算法自动求解Transformer模型中的参数，同时还可以与步骤(4-11)和步骤(4-12)中神经网络结合起来进行端到端的学习，得到更加合适的参数，进一步提高整个模型的性能；(5) The present invention proposes a framework based on Transformer model and graph embedding, which combines the advantages of Transformer and GCN for multi-target tracking problems, so the backpropagation algorithm can be used to automatically solve the parameters in the Transformer model, while also It can be combined with the neural network in steps (4-11) and steps (4-12) for end-to-end learning to obtain more suitable parameters and further improve the performance of the entire model;

(6)本发明应用范围广泛，不仅可用于行人跟踪，也可适用于任何已知类别的运动目标轨迹跟踪。(6) The present invention has a wide range of applications, not only for pedestrian tracking, but also for any known type of moving target track tracking.

附图说明Description of drawings

图1是本发明基于Transformer和图嵌入的多目标跟踪方法的流程图；Fig. 1 is the flow chart of the multi-target tracking method based on Transformer and graph embedding of the present invention;

图2是本发明方法的步骤(4)中提取的帧，其中图2(a)是前一帧，图2(b)是后一帧；Fig. 2 is the frame extracted in the step (4) of the inventive method, wherein Fig. 2 (a) is a previous frame, and Fig. 2 (b) is a subsequent frame;

图3是本发明方法的步骤(4-7)中Transformer编码器的结构示意图；Fig. 3 is the structural representation of Transformer encoder in the step (4-7) of the inventive method;

图4是本发明方法的步骤(4-8)和(4-9)中Transformer解码器的结构示意图；Fig. 4 is the structural representation of Transformer decoder in the step (4-8) and (4-9) of the inventive method;

图5是本发明方法的步骤(4-8)、(4-9)中提取的帧中的目标，其中图5(a)是前一帧中的目标，图5(b)是后一帧中的目标；Fig. 5 is the target in the frame that extracts in the step (4-8), (4-9) of the inventive method, wherein Fig. 5 (a) is the target in the previous frame, and Fig. 5 (b) is the following frame target in

图6是本发明方法的步骤(4-10)中构建图的示意图；Fig. 6 is the schematic diagram of constructing graph in the step (4-10) of the method of the present invention;

图7是本发明方法的步骤(4-11)中GCN的结构示意图；Fig. 7 is the structural representation of GCN in the step (4-11) of the method of the present invention;

图8是本发明方法的步骤(4-12)中关联网络的结构示意图。Fig. 8 is a schematic structural diagram of an association network in step (4-12) of the method of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

本发明的基本思路在于，提出了一种适用于多目标跟踪任务的“Transformer+GCN”端到端一体化多目标跟踪框架。一方面，将GCN模型形式化到Tansformer解码器中，利用GCN顶点传递特性，显式性引导建模近邻目标群间的局部相关性，以此提升Transformer针对场景中多目标组群特性学习的有效性。另一方面，受Transformer架构“attention isall your need”模型启发，针对多目标跟踪数据关联问题，提出关联网络，实现基于全注意力机制的线性规划网络模型，以此将数据关联形式化神经网络可实现形式嵌入到一体化MOT学习模型中,这种自主学习场景中适用于数据关联优化问题的特征表达及度量问题，进一步提升模型的有效特性。The basic idea of the present invention is to propose a "Transformer+GCN" end-to-end integrated multi-target tracking framework suitable for multi-target tracking tasks. On the one hand, the GCN model is formalized into the Transformer decoder, and the GCN vertex transfer characteristics are used to explicitly guide the modeling of local correlations between adjacent target groups, so as to improve the effectiveness of the Transformer in learning the characteristics of multiple target groups in the scene. sex. On the other hand, inspired by the "attention is all your need" model of the Transformer architecture, for the multi-target tracking data association problem, an association network is proposed to implement a linear programming network model based on the full attention mechanism, so that the data association can be formalized by the neural network. The implementation form is embedded in the integrated MOT learning model, which is applicable to the feature expression and measurement problems of data association optimization problems in this autonomous learning scenario, and further improves the effective characteristics of the model.

如图1所示，本发明提供了一种基于Transformer和图嵌入的多目标跟踪方法，包括以下步骤：As shown in Figure 1, the present invention provides a multi-target tracking method based on Transformer and graph embedding, including the following steps:

具体而言，本步骤是从MOT17数据集获取的某一个视频序列。Specifically, this step is a certain video sequence obtained from the MOT17 dataset.

本步骤中视频帧的时间长度是450到1500帧。The duration of the video frame in this step is 450 to 1500 frames.

(2)设置计数器cnt1＝1；(2) setting counter cnt1=1;

(4)将视频序列中的第cnt1帧(前一帧)和第cnt1+1帧(后一帧)输入到预先训练好的多目标跟踪模型中，以得到第cnt1帧中的目标和第cnt1+1帧中的对应目标间的分配矩阵；(4) Input the cnt1th frame (the previous frame) and the cnt1+1th frame (the next frame) in the video sequence into the pre-trained multi-target tracking model to get the target in the cnt1th frame and the cnt1th frame Allocation matrix between corresponding targets in +1 frame;

举例而言，本步骤中得到的分配矩阵如下，其中矩阵行代表第cnt1帧，列代表第cnt1+1帧：For example, the allocation matrix obtained in this step is as follows, where the rows of the matrix represent the cnt1th frame, and the columns represent the cnt1+1th frame:

从上面的分配矩阵可以看出，第cnt1帧中第一个目标与第cnt1+1帧第二个目标关联；第cnt1帧中第二个目标与第cnt1+1帧第一个目标关联；第cnt1帧中第三个目标与第cnt1+1帧第四个目标关联；第cnt1帧中第四个目标与第cnt1+1帧第三个目标关联。As can be seen from the above allocation matrix, the first target in the cnt1 frame is associated with the second target in the cnt1+1 frame; the second target in the cnt1 frame is associated with the first target in the cnt1+1 frame; The third object in the cnt1 frame is associated with the fourth object in the cnt1+1 frame; the fourth object in the cnt1 frame is associated with the third object in the cnt1+1 frame.

如图1所示，本发明的多目标跟踪模型包含依次连接的基准深度视觉特征提取网络(其为ResNet-50网络)、Transformer编码器、Transformer解码器、图卷积网络(Graphconvolutionalnetwork，简称GCN)、以及关联网络。As shown in Figure 1, the multi-target tracking model of the present invention includes a benchmark depth visual feature extraction network (which is a ResNet-50 network), a Transformer encoder, a Transformer decoder, and a graph convolutional network (Graphconvolutionalnetwork, referred to as GCN) connected in sequence , and associated networks.

第一层是基准深度视觉特征提取网络，其中包含1个卷积层、16个积木(Buildingblock)结构及1个全连接层，基准深度视觉特征提取网络的输入为前一帧和后一帧(其为RGB图像，宽度W为750，高度H为1333，通道数为3)，输出为前一帧对应的、维度为h×w×C的基准深度视觉特征以及后一帧对应的、维度为h×w×C的基准深度，其中h表示基准深度视觉特征的高度，w表示基准深度视觉特征的高度，C表示基准深度视觉特征的通道数，且有C＝2048。The first layer is the benchmark deep visual feature extraction network, which contains 1 convolutional layer, 16 building block (Building block) structures and 1 fully connected layer. The input of the benchmark deep visual feature extraction network is the previous frame and the next frame ( It is an RGB image, the width W is 750, the height H is 1333, and the number of channels is 3), the output is the reference depth visual feature corresponding to the previous frame with a dimension of h×w×C and the corresponding frame with a dimension of A reference depth of h×w×C, where h represents the height of the reference depth visual feature, w represents the height of the reference depth visual feature, C represents the number of channels of the reference depth visual feature, and C=2048.

第二层是Transformer编码器，其输入为前一帧和后一帧通过基准深度视觉特征提取网络得到的基准深度视觉特征,通过L层的Transformer编码，输出为前一帧对应的、维度为256维的全局特征和后一帧对应的、维度为256维的全局特征，其中L的取值范围是3到10，优选为6。The second layer is the Transformer encoder, whose input is the reference depth visual feature obtained through the reference depth visual feature extraction network of the previous frame and the next frame, and is encoded by the Transformer of the L layer, and the output is corresponding to the previous frame, with a dimension of 256 The global feature of the dimension and the global feature corresponding to the next frame with a dimension of 256 dimensions, wherein the value range of L is 3 to 10, preferably 6.

第三层是Transformer解码器，针对前一帧而言，其输入为对象查询(objectqueries)和Transformer编码器得到的全局特征，首先通过互注意力机制得到N个输出嵌入(维度256)，然后通过前馈神经网络得到N个框坐标和类标签，如图5(a)所示；针对后一帧而言，其输入为对象查询(object queries)与前一帧得到的N个输出嵌入(其作为跟踪查询)级联后的结果和第二层得到的全局特征，首先通过互注意力机制，得到M个输出嵌入(维度256)，然后通过前馈神经网络得到M个框坐标以及类标签，如图5(b)所示；The third layer is the Transformer decoder. For the previous frame, its input is the global feature obtained by the object query (objectqueries) and the Transformer encoder. First, N output embeddings (dimension 256) are obtained through the mutual attention mechanism, and then through The feed-forward neural network obtains N frame coordinates and class labels, as shown in Figure 5(a); for the next frame, its input is object queries and N output embeddings obtained in the previous frame (the As the result of cascading the tracking query) and the global features obtained by the second layer, first obtain M output embeddings (dimension 256) through the mutual attention mechanism, and then obtain M frame coordinates and class labels through the feedforward neural network. As shown in Figure 5(b);

第四层是图卷积网络(如图7)，其输入为第三层得到的前一帧的N个输出嵌入和后一帧的M个输出嵌入，首先通过确定近邻关系，分别得到的前一帧级联后的输出嵌入以及后一帧级联后的输出嵌入，然后通过图卷积网络分别对得到的前一帧级联后的输出嵌入以及后一帧级联后的输出嵌入进行迭代处理，分别输出前一帧的局部相关性增强特征以及后一帧的局部相关性增强特征；The fourth layer is a graph convolutional network (as shown in Figure 7). Its input is the N output embeddings of the previous frame and the M output embeddings of the next frame obtained by the third layer. First, by determining the neighbor relationship, the previous The output embedding after cascading of one frame and the output embedding after cascading of the next frame, and then iterate the output embedding of the concatenated previous frame and the output embedding of the next frame through the graph convolutional network respectively Processing, respectively output the local correlation enhancement feature of the previous frame and the local correlation enhancement feature of the next frame;

第五层是关联网络(如图8)，其输入为第四层得到的前一帧的局部相关性增强特征的和后一帧的局部相关性增强特征，对二者进行基于特征间的关联处理，以得到前一帧和后一帧对应目标间的分配矩阵。The fifth layer is the association network (as shown in Figure 8), and its input is the local correlation enhancement feature of the previous frame and the local correlation enhancement feature of the next frame obtained by the fourth layer, and the correlation between the two is based on the feature Processing to obtain the allocation matrix between the corresponding targets of the previous frame and the next frame.

本发明的多目标跟踪模型是通过以下步骤训练得到的：The multi-target tracking model of the present invention is trained through the following steps:

需要注意的是，本步骤中，CrowdHuman图像数据库的主要标注目标为移动的行人与车辆，MOT17数据集一共包含14个视频序列，其中7个视频序列为带有标注信息的训练集，另外7个视频序列为测试集，7个测试视频序列来自7个不同的场景，它们的拍摄视角和相机运动情况不一样，测试集总长度有5919帧，包含有188076个检测响应和785个轨迹。CrowdHuman图像数据库是拥挤环境下行人检测数据库，它包含15,000张图片，总共包含450K个目标。It should be noted that in this step, the main labeling targets of the CrowdHuman image database are moving pedestrians and vehicles. The MOT17 dataset contains a total of 14 video sequences, of which 7 video sequences are training sets with label information, and the other 7 The video sequence is the test set. The 7 test video sequences come from 7 different scenes, and their shooting angles and camera movements are different. The total length of the test set is 5919 frames, including 188076 detection responses and 785 trajectories. The CrowdHuman image database is a database for pedestrian detection in crowded environments, which contains 15,000 images and contains a total of 450K objects.

此外，针对MOT17数据集的数据量比较小的情况，在深度学习中，用于训练的数据量少将导致卷积神经网络学习到的特征更片面，所得模型泛化能力差，易发生过拟合。为了避免行人图像的特征和形态发生变化，本发明通过数据增强对数据集进行扩充，具体而言，本发明采用另一部分来自CrowHuman图像数据库，通过随机缩放和平移来增强单张图片的方式人为获得相邻帧。这样能有效降低模型的泛化误差，增加模型的鲁棒性。In addition, in view of the relatively small amount of data in the MOT17 dataset, in deep learning, the small amount of data used for training will lead to more one-sided features learned by the convolutional neural network, and the resulting model has poor generalization ability and is prone to overfitting. combine. In order to avoid changes in the characteristics and shapes of pedestrian images, the present invention expands the data set through data enhancement. Specifically, the present invention uses another part from the CrowHuman image database to artificially obtain a single image by random scaling and translation. adjacent frames. This can effectively reduce the generalization error of the model and increase the robustness of the model.

进而言之，本发明中的训练集用于调整多目标跟踪模型中可训练权重和偏置等参数，测试集则被用来调整多目标跟踪模型的学习率等超参数，测试集不参与模型的训练，用于统计测试多目标跟踪模型最后的预测效果。Furthermore, the training set in the present invention is used to adjust parameters such as trainable weights and biases in the multi-target tracking model, and the test set is used to adjust hyperparameters such as the learning rate of the multi-target tracking model. The test set does not participate in the model The training is used to statistically test the final prediction effect of the multi-target tracking model.

本步骤的优点在于，可以模拟更多的视频场景信息，从而挖掘到更多视频中目标的潜在信息。The advantage of this step is that more video scene information can be simulated, thereby mining more potential information of objects in the video.

(4-2)设置计数器cnt3＝1(其是用作整个训练集中视频序列的指针)；(4-2) Set counter cnt3=1 (it is used as the pointer of the video sequence in the whole training set);

(4-3)设置计数器cnt2＝1(其是用作视频序列中不同帧的指针)，判断cnt3是否等于训练集中视频序列的总数，如果是进入步骤(4-16)，否则进入步骤(4-4)；(4-3) setting counter cnt2=1 (it is used as the pointer of different frames in the video sequence), judge whether cnt3 is equal to the total number of video sequences in the training set, if enter step (4-16), otherwise enter step (4 -4);

具体而言，对象查询的数量远大于一帧中目标的总数。Specifically, the number of object queries is much larger than the total number of objects in one frame.

(4-7)针对步骤(4-6)获取的训练集中第cnt3个视频序列中的前一帧和后一帧而言，将步骤(4-6)得到的与前一帧对应的基准深度视觉特征I₁、以及与后一帧对应的基准深度视觉特征I₂作为Transformer编码器的查询(Query)、键(Key)和值(Value)，分别输入到Transformer编码器的多头自注意力层，以分别得到维度为256的特征向量，分别将256维的特征向量进行层归一化，以分别得到归一化后的特征向量，并分别将归一化后的特征输入Transformer编码器的前馈神经网络并归一化，以分别得到与前一帧对应的512维全局特征F₁，以及与后一帧对应的512维全局特征F₂。(4-7) For the previous frame and the next frame in the cnt3th video sequence in the training set obtained in step (4-6), the reference depth corresponding to the previous frame obtained in step (4-6) The visual feature I ₁ and the reference depth visual feature I ₂ corresponding to the next frame are used as the query (Query), key (Key) and value (Value) of the Transformer encoder, and are respectively input to the multi-head self-attention layer of the Transformer encoder , to obtain feature vectors with dimensions of 256, respectively, layer-normalize the 256-dimensional feature vectors to obtain normalized feature vectors, and input the normalized features to the front of the Transformer encoder. Feed the neural network and normalize to obtain the 512-dimensional global feature F ₁ corresponding to the previous frame and the 512-dimensional global feature F ₂ corresponding to the next frame.

进而言之，图3示出本步骤使用的Transformer编码器的结构，且网络结构为：整体由L个Transformer层组成，其中每个Transformer层由三部分组成：多头自注意力层、前馈神经网络和层归一化层组成。Furthermore, Figure 3 shows the structure of the Transformer encoder used in this step, and the network structure is: the whole is composed of L Transformer layers, and each Transformer layer is composed of three parts: multi-head self-attention layer, feedforward neural network Network and layer normalization layer composition.

本步骤的优点在于，Transformer编码器能够获取整个输入帧目标的全局信息。这种全局信息的利用有助于模型更好地提取整个输入帧目标的特征和位置信息。The advantage of this step is that the Transformer encoder can obtain the global information of the entire input frame target. The utilization of such global information helps the model to better extract the feature and position information of the target in the whole input frame.

(4-8)针对步骤(4-6)获取的训练集中第cnt3个视频序列中的前一帧而言，将前一帧的对象查询输入到多头自注意力层和层归一化层，得到维度为256维的对象查询，将步骤(4-7)得到的前一帧对应的全局特征F₁作为Transformer解码器的键(Key)和值(Value)、256维的对象查询作为查询(Query)，输入到Transformer解码器的多头互注意力层并归一化，以获取N个前一帧的输出嵌入(维度256)，将输出嵌入输入前馈神经网络并归一化，以得到512维的特征向量，并将512维的特征向量输入到前馈神经网络中，以得到前一帧的N个跟踪轨迹，每个跟踪轨迹包括框坐标和类标签；(4-8) For the previous frame in the cnt3 video sequence in the training set obtained in step (4-6), the object query of the previous frame is input to the multi-head self-attention layer and the layer normalization layer, Obtaining dimension is the object query of 256 dimensions, the global feature F ₁ corresponding to the previous frame obtained by step (4-7) is used as the key (Key) and value (Value) of Transformer decoder, and the object query of 256 dimensions is used as query ( Query), input to the multi-head mutual attention layer of the Transformer decoder and normalized to obtain the output embedding of the N previous frames (dimension 256), the output is embedded into the feedforward neural network and normalized to obtain 512 dimensional feature vector, and input the 512-dimensional eigenvector into the feedforward neural network to obtain N tracking trajectories of the previous frame, each tracking track includes frame coordinates and class labels;

如图5(a)所示，N个输出嵌入是作为后一帧的跟踪查询(track queries)；As shown in Figure 5(a), the N output embeddings are used as track queries for the next frame;

进而言之，在前一帧中，Transformer解码器只接受对象查询为查询输入，这时Transformer解码器实现的功能与检测器一样。Furthermore, in the previous frame, the Transformer decoder only accepts the object query as the query input, and the Transformer decoder implements the same function as the detector at this time.

更进而言之，图4示出本步骤使用Transformer解码器结构，且网络结构为：整体由L个Transformer层组成，其中每个Transformer层由四部分组成：多头自注意力层、多头互注意力层、前馈神经网络、层归一化层组成。其中，多头自注意力层、前馈神经网络以及层归一化层与Transformer编码器结构完全相同，多头互注意力层与多头自注意力层不同的是，输入的特征F的来源不同，Transformer解码器中的输入特征来自于对象查询(objectqueries)或跟踪查询(track queries)以及步骤(4-7)得到的全局特征。Furthermore, Figure 4 shows that this step uses the Transformer decoder structure, and the network structure is: the whole is composed of L Transformer layers, and each Transformer layer is composed of four parts: multi-head self-attention layer, multi-head mutual attention layer, a feed-forward neural network, and a layer normalization layer. Among them, the multi-head self-attention layer, feed-forward neural network and layer normalization layer are exactly the same structure as the Transformer encoder. The difference between the multi-head mutual attention layer and the multi-head self-attention layer is that the source of the input feature F is different. Transformer The input features in the decoder come from object queries or track queries and the global features from steps (4-7).

(4-9)针对步骤(4-6)获取的训练集中第cnt3个视频序列中的后一帧而言，先将该后一帧的跟踪查询和对象查询级联后输入到多头自注意力层和层归一化层，以得到维度为256维的级联查询，将级联查询作为查询(Query)、步骤(4-7)得到的全局特征F₂作为键(Key)和值(Value)输入到Transformer解码器的多头互注意力层并归一化，得到M个后一帧的输出嵌入(维度256)，将输出嵌入输入到前馈神经网络并归一化，以得到512维的特征向量，并将512维的特征向量输入到前馈神经网络中，以得到后一帧的M个检测目标，每个检测目标包括框坐标以及类标签，如图5(b)所示；(4-9) For the next frame in the cnt3th video sequence in the training set obtained in step (4-6), first concatenate the tracking query and object query of the latter frame and then input it to the multi-head self-attention Layer and layer normalization layer to obtain a cascaded query with a dimension of 256, using the cascaded query as the query (Query), and the global feature _F2 obtained in steps (4-7) as the key (Key) and value (Value ) is input to the multi-head mutual attention layer of the Transformer decoder and normalized to obtain M output embeddings of the next frame (dimension 256), and the output embeddings are input to the feedforward neural network and normalized to obtain a 512-dimensional feature vector, and input the 512-dimensional feature vector into the feed-forward neural network to obtain M detection targets in the next frame, each detection target includes frame coordinates and class labels, as shown in Figure 5(b);

进而言之，在后一帧中，将跟踪轨迹，即图5(a)，作为跟踪查询(track queries)，而对象查询(object queries)则作为新目标的标注依据，实现新目标的检测标定和特征表达。跟踪查询携带目标在整个视频序列中的身份信息，同时以自回归的方式匹配他们不断变化的位置。为此，每次新目标产生时都会初始化一个跟踪查询，它来自于前一帧的输出嵌入。Furthermore, in the next frame, the tracking trajectory, that is, Figure 5(a), is used as track queries, and object queries are used as the labeling basis for new targets to achieve the detection and calibration of new targets and feature expression. Tracking queries carry the identities of targets throughout the video sequence while matching their changing positions in an autoregressive fashion. To this end, a tracking query is initialized each time a new object is generated, which is derived from the output embedding from the previous frame.

上述步骤(4-7)到步骤(4-9)的优点在于，Transformer的编码器和解码器由多个相同的模块堆叠而成，每个模块都具有相同的结构和参数。这种模块化设计使得Transformer非常灵活，并且便于扩展和修改。可以根据任务需求自由添加或删除模块，使得Transformer适用于各种不同的处理任务。The advantage of the above steps (4-7) to (4-9) is that the encoder and decoder of the Transformer are stacked by multiple identical modules, and each module has the same structure and parameters. This modular design makes Transformer very flexible and easy to expand and modify. Modules can be added or deleted freely according to task requirements, making Transformer suitable for various processing tasks.

本步骤建立过程如图6所示，具体为，首先确定顶点集，顶点集合V1＝(v1、v2…vN)是由前一帧中所有跟踪轨迹组成，顶点集合V2＝(v1、v2…vM)是由后一帧中所有检测目标组成，顶点值(特征)分别为对应的前一帧的跟踪轨迹的输出嵌入和后一帧的检测目标的输出嵌入，然后分别确定顶点集合V1、V2中所有顶点的邻居节点以构成边，最后得到图模型G₁、G₂。The establishment process of this step is shown in Figure 6. Specifically, first determine the vertex set. The vertex set V1=(v1, v2...vN) is composed of all tracking trajectories in the previous frame, and the vertex set V2=(v1, v2...vM ) is composed of all detection targets in the next frame, and the vertex values (features) are the output embedding of the corresponding tracking track of the previous frame and the output embedding of the detection target of the next frame, and then determine the vertices in the set V1 and V2 respectively. Neighbor nodes of all vertices constitute edges, and finally graph models G ₁ and G ₂ are obtained.

进而言之，确定V1、V2中所有顶点的邻居节点过程如下，首先获取V1集合中所有顶点对应的框坐标和输出嵌入，然后根据V1确定的框坐标获取两个顶点对应的框坐标之间的交并比(Intersection over Union，简称IoU)以及输出嵌入获取两个顶点特征之间的余弦相似度，并选出每个顶点与除自身之外的所有顶点之间权重聚集系数最大的前K个作为其邻居节点，K的取值范围是1到3，优选为2，以构成边，并将/>作为顶点与邻居节点之间边的权重，确定V2中所有顶点的邻居节点过程与V1一样，在此不再赘述。Furthermore, the process of determining the neighbor nodes of all vertices in V1 and V2 is as follows. First, obtain the frame coordinates and output embeddings corresponding to all vertices in the V1 set, and then obtain the frame coordinates corresponding to the two vertices according to the frame coordinates determined by V1. Intersection over Union (IoU for short) and output embedding obtain the cosine similarity between two vertex features, and select the weight aggregation coefficient between each vertex and all vertices except itself The largest top K as its neighbor nodes, the value of K ranges from 1 to 3, preferably 2, to form an edge, and /> As the weight of the edge between the vertex and the neighbor node, the process of determining the neighbor nodes of all vertices in V2 is the same as that of V1, and will not be repeated here.

更进而言之，顶点与顶点之间边的权重为聚集系数(Aggregationweightcoefficient，简称AWC)表示顶点之间的相关性，定义为顶点i和顶点j的特征及空间距离相似性：Furthermore, the weight of the edge between vertices is the aggregation coefficient (Aggregationweightcoefficient, referred to as AWC) Represents the correlation between vertices, defined as the feature and spatial distance similarity of vertex i and vertex j:

cos(·,·)表示特征向量间的余弦相似度，IoU(·,·)表示两个目标位置坐标的交并比(Intersection over Union，简称IoU)，l表示GCN迭代层数，表示顶点i在第l层GCN的顶点特征，b_i表示顶点i的框坐标，i∈[1，M]，j∈[1，N]。cos(·,·) represents the cosine similarity between feature vectors, IoU(·,·) represents the intersection over Union (IoU for short) of the two target position coordinates, l represents the number of GCN iteration layers, Indicates the vertex feature of vertex i in the l-level GCN, b _i indicates the frame coordinates of vertex i, i∈[1,M], j∈[1,N].

(4-11)将步骤(4-10)得到的前一帧和后一帧图模型G₁、G₂，分别输入到图卷积网络(如图7所示)中，通过图卷积网络迭代次数的增加，得到前一帧的局部相关性增强特征E₁和后一帧的局部相关性增强特征E₂。(4-11) Input the graph models G ₁ and G ₂ of the previous frame and the next frame obtained in step (4-10) respectively into the graph convolutional network (as shown in Figure 7), through the graph convolutional network As the number of iterations increases, the local correlation enhanced feature E ₁ of the previous frame and the local correlation enhanced feature E ₂ of the next frame are obtained.

本步骤具体为，首先分别构建步骤(4-10)得到的G₁和G₂中每个顶点的特征矩阵，因为每个顶点(目标节点)与其K个邻居节点相连，则每个顶点与其邻居节点总数为Num＝K+1个，通过级联顶点与邻居节点的输出嵌入得到特征矩阵Ft，Ft∈R^Num×d即为该顶点的特征矩阵，其中d为每个节点特征维度，维度为256。然后，分别得到G₁和G₂中每个顶点与其所有邻居节点的二维邻接矩阵Mt∈R^Num×Num，设目标节点索引为1，则二维邻接矩阵初始化定义为:Specifically, this step is first to construct the feature matrix of each vertex in G ₁ and G ₂ obtained in step (4-10), because each vertex (target node) is connected to its K neighbor nodes, then each vertex and its neighbors The total number of nodes is Num=K+1, and the feature matrix Ft is obtained by cascading the vertex and the output embedding of the neighbor nodes. Ft∈R ^Num×d is the feature matrix of the vertex, where d is the feature dimension of each node, and the dimension is 256. Then, the two-dimensional adjacency matrix Mt∈R ^Num×Num of each vertex in G ₁ and G ₂ and all its neighbor nodes is obtained respectively, and the index of the target node is set to 1, then the initialization of the two-dimensional adjacency matrix is defined as:

其中，i,j∈{1,2,…,Num}，Mt＝1表示顶点i和顶点j之间有边相连，Num表示目标节点与其邻居节点的总数。随后将G₁和G₂中每个顶点输入到图卷积网络中，以得到聚合特征，其中每一次顶点特征更新都将更新近邻关系，公式为也即近邻内的顶点依相似性度量近邻相关性，设/>记为邻接矩阵的归一化(normalized)，则一次GCN模型的特征信息更新定义为：Among them, i, j∈{1,2,...,Num}, Mt=1 means that there is an edge connecting vertex i and vertex j, and Num means the total number of the target node and its neighbor nodes. Then each vertex in G ₁ and G ₂ is input into the graph convolutional network to obtain aggregated features, where each vertex feature update will update the neighbor relationship, the formula is That is, the vertices in the neighbors measure the neighbor correlation according to the similarity, set /> Recorded as the normalization of the adjacency matrix (normalized), the feature information update of a GCN model is defined as:

展开即为：expands to:

其中为第l层对应每个顶点i的特征矩阵，MLP为多层感知机(Multi-LayerPerceptron，简称MLP)，l为GCN迭代的层数，如图7所示，随着迭代次数的递增，邻域内对目标节点贡献一致的节点不断得到正反馈，这种一致性贡献体现着邻居节点对目标节点独特的相关性，最后将聚合特征输入到前馈神经网络中，以得到前一帧的局部相关性增强特征E₁和后一帧的局部相关性增强特征E₂。in is the feature matrix corresponding to each vertex i in layer l, MLP is a multi-layer perceptron (Multi-Layer Perceptron, referred to as MLP), and l is the number of GCN iteration layers, as shown in Figure 7, as the number of iterations increases, the adjacent Nodes in the domain that contribute consistently to the target node continuously receive positive feedback. This consistent contribution reflects the unique correlation of neighbor nodes to the target node. Finally, the aggregated features are input into the feedforward neural network to obtain the local correlation of the previous frame. Sex enhancement feature E ₁ and local correlation enhancement feature E ₂ of the next frame.

上述步骤(4-10)到步骤(4-11)的优点在于，将GCN模型形式化到Transformer解码器中，利用图顶点传递的特性，显式性引导建模近邻目标群体间的局部相关性，以此提升针对场景中多目标组群特性学习的有效性。The advantage of the above steps (4-10) to (4-11) is that the GCN model is formalized into the Transformer decoder, and the characteristics of graph vertex transfer are used to explicitly guide the modeling of local correlations between adjacent target groups , so as to improve the effectiveness of learning the characteristics of multi-target groups in the scene.

具体而言，首先，在使用关联网络进行数据关联之前，需要构建一个特征矩阵，由于每个目标的外观特征包含了高维度的信息，为了充分利用这些丰富的特征信息，本发明采用了将步骤(4-11)得到的前一帧局部相关性增强特征E₁和后一帧局部相关性增强特征E₂进行两两拼接的方法，以得到特征矩阵E。Specifically, first of all, before using the association network for data association, a feature matrix needs to be constructed. Since the appearance features of each target contain high-dimensional information, in order to make full use of these rich feature information, the present invention adopts the steps of (4-11) The obtained local correlation enhanced feature E ₁ of the previous frame and the local correlation enhanced feature E ₂ of the next frame are spliced in pairs to obtain the feature matrix E.

进而言之，如果前一帧跟踪轨迹是[R₁，R₂，…，R_n]，后一帧的检测目标是[r₁，r₂，…，r_m]，其中n表示前一帧跟踪轨迹的总数，m表示后一帧的检测目标的总数；则本步骤中构建的分配矩阵为：Furthermore, if the tracking track in the previous frame is [R ₁ , R ₂ ,..., R _n ], the detection target in the next frame is [r ₁ , r ₂ ,..., r _m ], where n represents the previous frame The total number of tracking trajectories, m represents the total number of detected targets in the next frame; then the allocation matrix constructed in this step is:

其中，矩阵中的元素P_yz表示前一帧的第y个跟踪轨迹和后一帧的第z个检测目标之间的关联置信度，在本实例中，y∈[1，n]，且有z∈[1，m]。Among them, the element P _yz in the matrix represents the correlation confidence between the yth tracking track of the previous frame and the zth detection target of the next frame. In this example, y∈[1,n], and z ∈ [1, m].

本步骤的优点在于，通过实现基于自注意力机制的线性规划网络模型，以此将数据关联形式化神经网络可实现形式，嵌入到一体化MOT学习模型中，这种自主学习场景中适用于基于特征表达的数据关联优化问题及度量问题，进一步提升模型的有效特性。The advantage of this step is that by implementing the linear programming network model based on the self-attention mechanism, the data association can be formalized into the realizable form of the neural network and embedded into the integrated MOT learning model. The data association optimization problem and measurement problem of feature expression further improve the effective characteristics of the model.

具体而言，检测损失、跟踪损失和focal损失获得步骤如下，首先获得后一帧中新检测的目标以及后一帧关联到前一帧的目标，将T_t-1和T_t分别代表前一帧跟踪轨迹和后一帧的检测目标,当任意轨迹trk∈T_t/T_t-1时，即对应T_t中的新目标但不属于T_t-1，本发明认为轨迹trk是后一帧中新检测的目标，这里用到检测损失；当任意轨迹trk∈T_k∩T_k-1时,即前一帧和后一帧共有的目标,本发明认为轨迹trk是后一帧关联到前一帧的目标，然后将后一帧中新检测的目标的框输入到坐标和类标签输入到检测损失函数，以得到检测损失；将后一帧关联到前一帧的目标的框坐标和类标签输入到跟踪损失函数里，以得到跟踪损失。随后，将步骤(4-12)得到的前一帧的跟踪轨迹和后一帧的检测目标之间的分配矩阵输入到focal损失函数里，以得到focal损失。Specifically, the detection loss, tracking loss, and focal loss are obtained as follows. First, the newly detected target in the next frame and the target associated with the previous frame from the next frame are obtained, and T _t-1 and T _t respectively represent the previous Frame tracking trajectory and the detection target of the next frame, when any trajectory trk∈T _t /T _t-1 , it corresponds to a new target in T _t but does not belong to T _t-1 , and the present invention considers that the trajectory trk is the next frame In the newly detected target, the detection loss is used here; when any trajectory trk∈T _k ∩T _k-1 , that is, the target shared by the previous frame and the next frame, the present invention considers that the trajectory trk is associated with the previous frame The target of one frame, and then input the box of the newly detected target in the next frame to the coordinates and the class label to the detection loss function to obtain the detection loss; associate the next frame with the frame coordinates and class of the target of the previous frame The labels are fed into the tracking loss function to get the tracking loss. Subsequently, the distribution matrix between the tracking trajectory of the previous frame and the detection target of the next frame obtained in step (4-12) is input into the focal loss function to obtain the focal loss.

进而言之，检测损失函数定义如下：Furthermore, the detection loss function is defined as follows:

其中L_cls是基于新检测目标的类标签与训练集真值类标签之间的focal loss，L_box和L_giou是计算归一化后的框坐标与训练集真值框坐标之间的L1距离和广义IoU而λ_cls、λ_L1、λ_giou为分别对应各部分的权重参数，分别为2,2,3。相似地,跟踪损失函数L_trk如下:Among them, L _cls is the focal loss between the class label based on the new detection target and the true value class label of the training set, and L _box and L _giou are the L1 distance between the normalized frame coordinates and the true value box coordinates of the training set and generalized IoU and λ _cls , λ _L1 , and λ _giou are the weight parameters corresponding to each part, which are 2, 2, and 3 respectively. Similarly, the tracking loss function L _trk is as follows:

关联网络在得到分配矩阵时，矩阵中成功匹配的目标会远远少于未匹配成功的目标，因此会产生大量的负样本，如果使用交叉熵损失函数会由于正负样本的不平衡导致损失函数无法收敛的问题，因此本发明采用focalloss作为关联网络中计算匹配关系的损失函数。Focalloss是交叉熵损失函数的改进版，专门为了解决正负样本不平衡的问题，其定义如下：When the association network gets the distribution matrix, the successfully matched targets in the matrix will be far less than the unmatched targets, so a large number of negative samples will be generated. If the cross-entropy loss function is used, the loss function will be caused by the imbalance of positive and negative samples. Therefore, the present invention adopts focalloss as the loss function for calculating the matching relationship in the association network. Focalloss is an improved version of the cross-entropy loss function, which is specially designed to solve the problem of imbalance between positive and negative samples. It is defined as follows:

更进而言之，设置Transformer网络训练参数，模型采用了在ImageNet预训练的ResNet-50作为CNN backbone，使用Deformable-DETR中的Encoder和Decoder架构，并在基于CrowHuman上预训练模型上结合在MOT17和CrowHuman上联合微调，本模型采用了Trackformer的训练策略,直接在MOT17和CrowHuan上进行微调,使用AdamW优化器,这是由于该优化器具有动量以及自适应学习率的优势，初始学习率为2×10^-4，批次数为2；设置图卷积网络训练参数，算法采用从零训练方式，图卷积网络层数为3，使用ImageNet预训练的ResNet-50模型来初始化模型权重，优化器为SGD，初始学习率为1×10^-3，并在40轮后降为1×10^-4；设置关联网络训练参数，关联网络模型的训练参照DeepMOT创建的方法进行训练，该方法将数据集的真值标签由分配矩阵组成。分配矩阵是由0和1组成的二维矩阵，代表标签目标之间的真实匹配结果。因此该数据集适合本发明设计的关联网络用来训练数据关联过程。初始学习率设置为0.005，每10个epoch学习率下降10倍，优化器设置为SGD；Furthermore, to set the Transformer network training parameters, the model uses the ResNet-50 pre-trained on ImageNet as the CNN backbone, uses the Encoder and Decoder architectures in Deformable-DETR, and combines the pre-training model based on CrowHuman on MOT17 and Joint fine-tuning on CrowHuman, this model adopts Trackformer training strategy, fine-tuning directly on MOT17 and CrowHuan, using AdamW optimizer, this is because the optimizer has the advantages of momentum and adaptive learning rate, the initial learning rate is 2× 10 ^-4 , the number of batches is 2; set the graph convolutional network training parameters, the algorithm adopts the training method from zero, the number of graph convolutional network layers is 3, use the ImageNet pre-trained ResNet-50 model to initialize the model weight, and the optimizer is SGD, the initial learning rate is 1×10 ^-3 , and it will be reduced to 1×10 ^-4 after 40 rounds; the training parameters of the associated network are set, and the training of the associated network model is carried out according to the method created by DeepMOT. The ground-truth labels consist of an assignment matrix. The assignment matrix is a two-dimensional matrix composed of 0s and 1s, representing the true matching results between labeled targets. Therefore, this data set is suitable for the association network designed by the present invention to be used for training the data association process. The initial learning rate is set to 0.005, the learning rate is reduced by 10 times every 10 epochs, and the optimizer is set to SGD;

本步骤的优点在于，将多目标跟踪中目标检测、特征表达以及数据关联推理统一嵌入到深度网络模型中，极大的发挥了多任务联合优化学习的优势。The advantage of this step is that the target detection, feature expression and data association reasoning in multi-target tracking are unified embedded into the deep network model, which greatly exerts the advantages of multi-task joint optimization learning.

总而言之，本发明提出一种基于Transformer和图嵌入的多目标跟踪算法。一方面，将GCN模型形式化到Tansformer解码器中，利用GCN顶点传递特性，显式性引导建模近邻目标群间的局部相关性，以此提升Transformer针对场景中多目标组群特性学习的有效性。另一方面，受Transformer架构“attention is all your need”模型启发，针对多目标跟踪数据关联问题，提出关联网络，实现基于全注意力机制的线性规划网络模型，以此将数据关联形式化神经网络可实现形式嵌入到一体化MOT学习模型中,这种自主学习场景中适用于数据关联优化问题的特征表达及度量问题，进一步提升模型的有效特性。因此，本发明能够有效体现多目标跟踪过程中真实数据的相关性，跟踪结果的准确性高。In summary, the present invention proposes a multi-target tracking algorithm based on Transformer and graph embedding. On the one hand, the GCN model is formalized into the Transformer decoder, and the GCN vertex transfer characteristics are used to explicitly guide the modeling of local correlations between adjacent target groups, so as to improve the effectiveness of the Transformer in learning the characteristics of multi-target groups in the scene. sex. On the other hand, inspired by the "attention is all your need" model of the Transformer architecture, for the multi-target tracking data association problem, an association network is proposed to implement a linear programming network model based on a full attention mechanism, so as to formalize data association into a neural network The form can be embedded into the integrated MOT learning model, which is suitable for feature expression and measurement problems of data association optimization problems in this autonomous learning scenario, and further improves the effective characteristics of the model. Therefore, the present invention can effectively reflect the correlation of real data in the process of multi-target tracking, and the accuracy of tracking results is high.

实验结果Experimental results

这里通过在MOT17测试集上的测试结果来说明本发明的实际效果。通过使用以下的标准评估指标对本发明提出的多目标跟踪算法在MOT17数据计集上的跟踪结果进行评估，这些标准评估指标包括：多目标跟准确度(Multiple Object Tracking Accuracy，简称MOTA)、多目标跟踪精度(Multiple Object Tracking Precision，简称MOTP)、跟踪精准度(ID F1 Score,简称IDF1)、误报(False Positives，简称FP)，假阴性(False Negatives，简称FN)、高阶跟踪精度(Higher Order Tracking Accuracy，简称HOTA)、关联准确度(Association Accuracy，简称AssA)和检测准确度(Detection Accuracy，简称DetA)以及身份切换(ID Switches，简称IDS)。“↑”表示越高越好，‘↓’表示越低越好。如下表1所示，其列出本发明与现有的性能优越的TransCenter算法、TransTrack算法以及TrackFormer算法在MOT17测试集上的测试结果的详细比较。The actual effect of the present invention is illustrated here by the test results on the MOT17 test set. By using the following standard evaluation indicators, the tracking results of the multi-target tracking algorithm proposed by the present invention on the MOT17 data set are evaluated. These standard evaluation indicators include: multiple objects and accuracy (Multiple Object Tracking Accuracy, referred to as MOTA), multi-object Tracking accuracy (Multiple Object Tracking Precision, referred to as MOTP), tracking accuracy (ID F1 Score, referred to as IDF1), false positives (False Positives, referred to as FP), false negatives (False Negatives, referred to as FN), high-order tracking accuracy (Higher Order Tracking Accuracy (HOTA for short), Association Accuracy (AssA for short), Detection Accuracy (DetA for short), and ID Switches (IDS for short). "↑" means the higher the better, '↓' means the lower the better. As shown in Table 1 below, it lists the detailed comparison of the test results of the present invention and the existing superior performance TransCenter algorithm, TransTrack algorithm and TrackFormer algorithm on the MOT17 test set.

表1Table 1

通过上表1可以看出：(1)本发明的方法在HOTA上排在第一位，IDF1和AssA三个指标上面都超越了其他三种算法。其中，HOTA是评估算法性能整体性能的主要指标，相比于其他三个跟踪算法本发明提出的跟踪算法取得了最好的成绩，这表明本发明提出的跟踪算法在整体性能上要优于其他三个算法；It can be seen from Table 1 that: (1) the method of the present invention ranks first on HOTA, and the three indexes of IDF1 and AssA surpass the other three algorithms. Among them, HOTA is the main indicator of evaluating the overall performance of the algorithm performance. Compared with the other three tracking algorithms, the tracking algorithm proposed by the present invention has achieved the best results, which shows that the tracking algorithm proposed by the present invention is better than other tracking algorithms in terms of overall performance. three algorithms;

(2)更高的IDF1和更高的AssA表明，本发明提出的方法通过通过图模型传递邻域特性的优势引导Transformer显式挖掘场景局部上下文信息可以显著提高MOT中Re-ID的性能，从而提高跟踪性能。(2) Higher IDF1 and higher AssA indicate that the method proposed in the present invention can significantly improve the performance of Re-ID in MOT by guiding the Transformer to explicitly mine the local context information of the scene through the advantage of transferring the neighborhood characteristics through the graph model, thus Improve tracking performance.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

1. A multi-target tracking method based on Transformer and graph embedding, is characterized in that, comprises the following steps:

(1) Obtain a video sequence, and read the video sequence frame by frame to obtain all frames therein;

(2) setting counter cnt1=1;

(3) judge whether cnt1 has been equal to the total number of frames in the video sequence, if so the process ends, otherwise enter step (4);

(4) Input the cnt1th frame and the cnt1+1th frame in the video sequence into the pre-trained multi-target tracking model to obtain the distribution between the target in the cnt1th frame and the corresponding target in the cnt1+1th frame Matrix, where the cntth frame is the previous frame, and the cnt+1th frame is the next frame;

(5) According to the allocation matrix between the target in the cnt1 frame obtained in step (4) and the corresponding target in the cnt1+1 frame, obtain all targets associated with each target in the cnt+1 frame, Constitute the target in the cnt frame together with all the obtained targets to track the target;

(6) Set the counter cnt1=cnt1+1, and return to step (3).

2. The multi-target tracking method based on Transformer and graph embedding according to claim 1, wherein the multi-target tracking model includes a sequentially connected benchmark depth visual feature extraction network, Transformer encoder, Transformer decoder, graph convolution Network GCN, and associated networks.

3. the multi-target tracking method based on Transformer and graph embedding according to claim 1 or 2, characterized in that,

The first layer is the benchmark deep visual feature extraction network, which includes 1 convolutional layer, 16 building block structures and 1 fully connected layer. The input of the benchmark deep visual feature extraction network is the previous frame and the next frame, and the output is the previous frame. The reference depth visual feature with dimensions h×w×C corresponding to one frame and the reference depth with dimension h×w×C corresponding to the next frame, where h represents the height of the reference depth visual feature, and w represents the reference depth visual feature The height of the feature, C represents the channel number of the reference depth visual feature, and C=2048.

The second layer is the Transformer encoder, whose input is the reference depth visual feature obtained through the reference depth visual feature extraction network of the previous frame and the next frame, and is encoded by the Transformer of the L layer, and the output is corresponding to the previous frame, with a dimension of 256 The global feature of dimension and the global feature of dimension 256 corresponding to the next frame, where the value range of L is 3 to 10.

The third layer is the Transformer decoder. For the previous frame, its input is the object query and the global features obtained by the Transformer encoder. First, N output embeddings are obtained through the mutual attention mechanism, and then N output embeddings are obtained through the feedforward neural network. Frame coordinates and class labels; for the next frame, the input is the result of the object query and the N output embeddings obtained in the previous frame and the global features obtained in the second layer. First, through the mutual attention mechanism, get M output embeddings and then get M frame coordinates and class labels through the feedforward neural network;

The fourth layer is a graph convolutional network, whose input is the N output embeddings of the previous frame and the M output embeddings of the next frame obtained by the third layer. First, by determining the neighbor relationship, the previous frames obtained respectively are concatenated. The output embedding of the concatenated output of the previous frame and the output embedding of the next frame concatenated, and then iteratively process the output embedding of the concatenated previous frame and the output embedding of the concatenated next frame through the graph convolutional network, and output the former The local correlation enhanced features of one frame and the local correlation enhanced features of the next frame;

The fifth layer is the association network, whose input is the local correlation enhanced feature of the previous frame and the local correlation enhanced feature of the next frame obtained by the fourth layer, and the two are processed based on the correlation between the features to obtain the previous frame. One frame and the next frame correspond to the assignment matrix between objects.

4. The multi-target tracking method based on Transformer and graph embedding according to any one of claims 1 to 3, wherein the multi-target tracking model is obtained through the following steps of training:

(4-1) Obtain the MOT17 data set and the CrowdHuman image database, divide the obtained MOT17 data set into the first half and the second half according to the proportion, and use the first half of the obtained MOT17 data set and the entire CrowdHuman image database as a training set , using the second half of the resulting MOT17 dataset as the test set.

(4-2) setting counter cnt3=1;

(4-3) counter cnt2=1 is set, judge whether cnt3 is equal to the total number of video sequences in the training set, if enter step (4-16), otherwise enter step (4-4);

(4-4) judge whether cnt2 is equal to the total number of frames in the cnt3 video sequence in the training set, if it is then proceed to step (4-14), otherwise enter step (4-5);

(4-5) Initialize N _obj learnable object queries (objectqueries) as the object queries of the previous frame and the next frame respectively, wherein N _obj =500;

(4-6) Use the cnt2 frame in the cnt3 video sequence in the training set as the previous frame, use the cnt2+1 frame as the next frame, and input the previous frame and the next frame into the benchmark depth visual feature extraction network respectively , to obtain the reference depth visual feature I ₁ corresponding to the previous frame with a dimension of 2048, and the reference depth visual feature I ₂ with a dimension of 2048 corresponding to the next frame.

(4-7) For the previous frame and the next frame in the cnt3th video sequence in the training set obtained in step (4-6), the reference depth corresponding to the previous frame obtained in step (4-6) The visual feature I ₁ and the reference depth visual feature I ₂ corresponding to the next frame are used as the query, key and value of the Transformer encoder, and are respectively input to the multi-head self-attention layer of the Transformer encoder to obtain feature vectors with a dimension of 256 , respectively normalize the 256-dimensional feature vectors to obtain the normalized feature vectors respectively, and input the normalized features into the feedforward neural network of the Transformer encoder and normalize them respectively to obtain the normalized feature vectors respectively A 512-dimensional global feature F ₁ corresponding to the previous frame and a 512-dimensional global feature F ₂ corresponding to the next frame are obtained.

(4-8) For the previous frame in the cnt3th video sequence in the training set obtained in step (4-6), input the object query of the previous frame to the multi-head self-attention layer and the layer normalization layer to obtain For an object query with a dimension of 256, the global feature F ₁ corresponding to the previous frame obtained in step (4-7) is used as the key and value of the Transformer decoder, and the 256-dimensional object query is used as a query, and input to the multi-head of the Transformer decoder The mutual attention layer is normalized to obtain the output embedding of N previous frames, and the output is embedded into the feedforward neural network and normalized to obtain a 512-dimensional feature vector, and the 512-dimensional feature vector is input to In the feed-forward neural network, to obtain N tracking trajectories of the previous frame, each tracking track includes frame coordinates and class labels;

(4-9) For the next frame in the cnt3th video sequence in the training set obtained in step (4-6), the tracking query and object query of the latter frame are first cascaded and then input to the multi-head self-attention layer And layer normalization layer to obtain a cascaded query with a dimension of 256 dimensions, and use the cascaded query as a query, and the global feature _F2 obtained in steps (4-7) as a key and value input to the multi-head mutual attention of the Transformer decoder The force layer is normalized to obtain the output embedding of the M subsequent frames, and the output embedding is input to the feedforward neural network and normalized to obtain a 512-dimensional feature vector, and the 512-dimensional feature vector is input to the feedforward In the neural network, to obtain M detection targets in the next frame, each detection target includes frame coordinates and class labels;

(4-10) Establish graph models G 1 , G ₂ for the N tracking trajectories obtained in step (4-8) and the M detection targets obtained in step ( _4-9 );

(4-11) Input the graph models G ₁ and G ₂ of the previous frame and the next frame obtained in step (4-10) into the graph convolutional network respectively, and by increasing the number of iterations of the graph convolutional network, the former The local correlation enhanced feature E ₁ of one frame and the local correlation enhanced feature E ₂ of the next frame.

(4-12) Input the local correlation enhancement feature E ₁ of the previous frame and the local correlation enhancement feature E ₂ of the next frame obtained in step (4-11) into the association network, and perform correlation processing based on features, To obtain the allocation matrix between the tracking track of the previous frame and the detection target of the next frame;

(4-13) Use the N frame coordinates and class labels obtained in step (4-8), and the M frame coordinates and class labels obtained in step (4-9) to Transformer encoder, Transformer decoder, graph convolution Network and associated network for training, according to the previous frame and the next frame in the training set obtained by the Transformer encoder, Transformer decoder and GCN for step (4-6), the frame coordinates, class labels and local Correlation enhancement features and the frame coordinates, class labels, and local correlation enhancement features of the detection target in the next frame, the distribution matrix between the tracking trajectory of the previous frame and the detection target of the next frame is obtained according to the association network, and the frame coordinates are obtained by , class label and assignment matrix are input into the defined loss function to get detection loss, tracking loss and focal loss;

(4-14) cnt2=cnt2+1, and return to step (4-4);

(4-15) cnt3=cnt3+1, and return to step (4-3);

(4-16) According to the detection loss, tracking loss and focal loss obtained in step (4-13), and use the backpropagation method to iteratively train the multi-target tracking model until the multi-target tracking model converges, thereby obtaining a preliminary A trained multi-object tracking model.

(4-17) Use the test set obtained in step (4-1) to verify the multi-target tracking model preliminarily trained in step (4-16) until the obtained tracking accuracy reaches the optimal level, so as to obtain the multi-target tracking model trained Object tracking model.

5. The multi-target tracking method based on Transformer and graph embedding according to claim 4, wherein the step (4-10) is specifically, at first determining the vertex set, and the vertex set V1=(v1, v2...vN) is Consists of all tracking trajectories in the previous frame, the vertex set V2=(v1, v2...vM) is composed of all detection targets in the next frame, and the vertex values are the output embedding and the following tracking trajectories of the corresponding previous frame The output of the detected target of the frame is embedded, and then the neighbor nodes of all vertices in the vertex sets V1 and V2 are respectively determined to form edges, and finally the graph models G ₁ and G ₂ are obtained.

6. the multi-target tracking method based on Transformer and graph embedding according to claim 5 is characterized in that, step (4-10) determines the neighbor node process of all vertices in V1, V2 as follows, at first obtains all vertices in the V1 collection Corresponding frame coordinates and output embedding, and then according to the frame coordinates determined by V1 to obtain the intersection ratio IoU between the frame coordinates corresponding to the two vertices and the output embedding to obtain the cosine similarity between the two vertex features, and select each Weight aggregation coefficient between a vertex and all vertices except itself The largest top K as its neighbor nodes to form an edge, and /> As the weight of the edges between vertices and neighbor nodes, the process of determining the neighbor nodes of all vertices in V2 is the same as that of V1.

weight clustering factor Indicates the correlation between vertices, which is equal to:

cos(·,·) represents the cosine similarity between feature vectors, IoU(·,·) represents the intersection and union ratio IoU of two target position coordinates, l represents the number of GCN iteration layers, Indicates the vertex feature of vertex i in the l-level GCN, b _i indicates the frame coordinates of vertex i, i∈[1,M], j∈[1,N].

7. The multi-target tracking method based on Transformer and graph embedding according to claim 6, characterized in that, step (4-11) is specifically, at first constructing _G1 and _G2 that step (4-10) obtains respectively The feature matrix of each vertex, the total number of each vertex and its neighbor nodes is Num=K+1, the feature matrix Ft is obtained by cascading the output of the vertex and the neighbor node, and Ft∈R ^Num×d is the feature matrix of the vertex , where d is the feature dimension of each node, and the dimension is 256. Then, the two-dimensional adjacency matrix Mt∈R ^Num×Num of each vertex in G ₁ and G ₂ and all its neighbor nodes is obtained respectively, and the index of the target node is set to 1, then the initialization of the two-dimensional adjacency matrix is defined as:

Among them, i,j∈{1,2,…,Num}, Mt=1 means there is an edge connection between vertex i and vertex j, Num means the total number of the target node and its neighbor nodes, and then each of G ₁ and G ₂ Vertices are input into the graph convolutional network to obtain aggregated features, where each vertex feature update will update the neighbor relationship, the formula is That is, the vertices in the neighbors measure the neighbor correlation according to the similarity, set /> Recorded as the normalization of the adjacency matrix, the feature information update of a GCN model is defined as:

expands to:

in is the feature matrix corresponding to each vertex i in layer l, MLP is a multi-layer perceptron MLP, and l is the number of GCN iteration layers. As the number of iterations increases, the nodes in the neighborhood that contribute consistently to the target node continue to receive positive feedback. This consistency contribution embodies the unique correlation of neighbor nodes to the target node, and finally the aggregated features are input into the feedforward neural network to obtain the local correlation enhancement feature E ₁ of the previous frame and the local correlation of the next frame Enhancement feature E ₂ .

8. the multi-target tracking method based on Transformer and graph embedding according to claim 7, is characterized in that,

Step (4-12) is specifically, first, before using the association network for data association, it is necessary to construct a feature matrix. Since the appearance features of each target contain high-dimensional information, in order to make full use of these rich feature information, this paper The invention adopts the method of splicing the local correlation enhanced feature E ₁ of the previous frame and the local correlation enhanced feature E ₂ of the next frame obtained in step (4-11) to obtain the feature matrix E.

Then the feature matrix E is sent to three different 1*1 convolutional layers, and compressed to the MN×2C dimension through the Reshape operation to obtain Q∈R ^MN×2C , K∈R ^MN×2C , V∈R ^MN×2C , where C=256, perform matrix multiplication on Q and K to obtain the correlation between two tracking targets, and then obtain E _{Q, K} after Softmax operation, and then perform a matrix on E _{Q, K} and V multiplication to get a further incidence matrix.

Finally, the matrix is sent to the feed-forward network layer and through two residual connections, the tracking trajectory in the previous frame and the deeper correlation features of the detection target in the next frame are obtained, and the assignment matrix A∈RM ^×N is finally obtained.

9. the multi-target tracking method based on Transformer and graph embedding according to claim 8, is characterized in that,

The process of obtaining detection loss, tracking loss and focal loss in step (4-13) is as follows. First, obtain the newly detected target in the subsequent frame and the target associated with the subsequent frame to the previous frame. T _t-1 and T _t Represents the tracking trajectory of the previous frame and the detection target of the next frame respectively. When any trajectory trk∈T _t /T _t-1 , it corresponds to a new target in T _t but does not belong to T _t-1 , and then the next frame The frame coordinates and class labels of the newly detected target are input into the detection loss function to obtain the detection loss; the frame coordinates and class labels of the target associated with the previous frame are input into the tracking loss function to obtain Track loss. Subsequently, the distribution matrix between the tracking trajectory of the previous frame and the detection target of the next frame obtained in step (4-12) is input into the focal loss function to obtain the focal loss.

The detection loss function is defined as follows:

L _det ＝λ _cls L _cls +λ _L1 L _box +λ _giou L _giou

Among them, L _cls is the focal loss between the class label based on the new detection target and the true value class label of the training set, and L _box and L _giou are the L1 distance between the normalized frame coordinates and the true value box coordinates of the training set and generalized IoU and λ _cls , λ _L1 , λ _giou are the weight parameters corresponding to each part;

The tracking loss function L _trk is as follows:

L _trk ＝λ _cls L _cls +λ _L1 L _box +λ _giou L _giou

The focal loss function is equal to:

The value range of the modulation coefficient γ is 1 to 5, preferably 2, the value range of α is 0.1-0.7, preferably 0.5, y=1 indicates successful matching, and y=0 indicates no matching.

10. A multi-target tracking system based on Transformer and graph embedding, characterized in that, comprising:

The first module is used to obtain a video sequence, and reads the video sequence frame by frame to obtain all frames therein;

The second module is used to set the counter cnt1=1;

The third module is used to judge whether cnt1 has been equal to the total number of frames in the video sequence, if so, the process ends, otherwise enter the fourth module;

The fourth module is used to input the cnt1 frame and the cnt1+1 frame in the video sequence into the pre-trained multi-target tracking model to obtain the target in the cntl frame and the corresponding target in the cnt1+1 frame The allocation matrix between, where the cntth frame is the previous frame, and the cnt+1th frame is the next frame;

The fifth module is used to obtain all the objects associated with each target in the cnt1th frame in the cnt+1th frame according to the allocation matrix between the target in the cnt1th frame obtained by the fourth module and the corresponding target in the cnt1+1th frame Target, the target in the cnt frame and all the obtained targets constitute the tracking track of the target;

The sixth module is used to set the counter cnt1=cnt1+1, and return to the third module.