CN114529578A

CN114529578A - Multi-target tracking method based on comparison learning mode

Info

Publication number: CN114529578A
Application number: CN202210071761.0A
Authority: CN
Inventors: 赵长双
Original assignee: Zhejiang Zero Run Technology Co Ltd
Current assignee: Zhejiang Zero Run Technology Co Ltd
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2022-05-24

Abstract

The invention discloses a multi-target tracking method based on a comparison learning mode, which comprises the following steps: extracting basic features of an input image through a main structure; fusing the bottom layer structural features and deep semantic features of the main structure through the neck structure; the head structure decouples the received shared features into two independent feature representations through the cross-correlation module, and the detection module and the ReID module receive corresponding features from the cross-correlation module in parallel. The detection module and the ReiD module are regarded as tasks at the same level, so that the independent training and secondary reasoning process of the ReiD module is avoided; the cross-correlation module is represented by the respective characteristics of the independent learning detection module and the ReiD module, so that the problems of competition and granularity of the two modules are solved; a contrast learning mode is used as a supervision signal of the ReiD module, the number of classified example categories does not need to be counted, and the model learning efficiency is improved.

Description

Multi-target tracking method based on comparison learning mode

Technical Field

The invention relates to the technical field of target tracking, in particular to a multi-target tracking method based on a comparison learning mode.

Background

The multi-target tracking technology relates to the field of artificial intelligence, in particular to the computer vision and deep learning technology, and is widely applied to the fields of intelligent video monitoring, intelligent security, intelligent parks, intelligent cities, automatic driving and the like.

One of the methods currently widely studied and implemented for multi-target tracking is detection-based tracking, which requires that a positioning frame of a target in an input image is obtained by detection, and then the next tracking is performed according to the positioning frame of the target. The multi-target tracking method based on detection can be described as the matching problem of target positioning frames of front and back frame images, the target is to correspond the positioning frames of objects in the front frame image and the positioning frames of objects in the back frame image one by one, and the corresponding relation is expanded to the whole video stream to form the tracking tracks of a plurality of objects in the video.

The general method for matching the positioning frames of the target in the front and rear frames of images is as follows: determining target positioning frame information in the front frame image and the rear frame image by using a detection module; extracting features of a region framed by a positioning frame of a target in the image by using a ReID module; selecting a certain characteristic measurement mode (such as Euclidean distance, Mahalanobis distance, cosine distance and the like) to calculate similarity measurement on the extracted characteristics; matching the positioning frames of the targets in the front frame image and the rear frame image by using a certain matching algorithm (such as a greedy algorithm, a Hungary algorithm and the like) according to the similarity measurement; and obtaining the optimal matching result, namely the final tracking result.

The existing multi-target tracking method has the following problems:

(1) most of a detection module and a ReID (person re-identification) module of the current multi-target tracking method are in a precedence relationship. The ReID module needs to receive the target positioning frames determined by the detection module, and the processing times of the ReID module are in direct proportion to the number of the target positioning frames, so that the calculation complexity is increased.

(2) In the current multi-target tracking method, a detection module and a ReID module are subjected to one-shot (one-shot) processing, and required information is learned from the same characteristic level. Although the problem of computational complexity is solved, in practice, two modules have a competitive relationship and are not beneficial to network learning.

(3) The current multi-target tracking method is used as a supervision signal of a ReID module in a classification mode, and the number of instance classes of a data set is strongly coupled with the final tracking performance. Specifically, the number of the instance types is directly related to the quality of the re-identification features of the ReID module, so that the subsequent similarity matching precision is influenced, and the final tracking performance is further influenced.

As in chinese patent CN109522843B, published 2021, 7 and 2, a multi-target tracking method and apparatus, device and storage medium, wherein the method comprises: determining a pedestrian detection frame of a target to be tracked in a video to be processed; determining pedestrian posture information and pedestrian re-identification characteristics of the target to be tracked according to the pedestrian detection frame; determining a similarity matrix corresponding to two adjacent frames of videos in the video to be processed according to the pedestrian posture information, the pedestrian re-identification feature and the pedestrian detection frame; and tracking the target to be tracked according to the similarity matrix to obtain a tracking result of the target to be tracked. The method has the problems that the detection module and the ReID module are mutually independent, so that the calculation complexity is high and the calculation efficiency is low.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the current multi-target tracking method has the technical problems of high calculation complexity and low calculation efficiency caused by mutual independence of a detection module and a ReID module. The multi-target tracking method based on the comparison learning mode can avoid independent training and secondary reasoning processes of the ReiD module, reduce the calculation complexity and improve the calculation efficiency.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a multi-target tracking method based on a comparison learning mode comprises the following steps:

extracting basic features of an input image through a main structure;

fusing the bottom layer structural features and the deep semantic features of the main structure through the neck structure, and outputting the obtained multi-scale feature information to the head structure as shared features;

the head structure decouples the received shared features into two independent feature representations through a cross-correlation module, a detection module and a ReID module receive corresponding features from the cross-correlation module in parallel, a target positioning frame is obtained through the detection module, and re-identification features of a framing area are obtained through the ReID module;

and calculating and matching similarity measurement through a post-processing module to complete target tracking of the image.

The head structure comprises a cross-correlation module, a detection module and a ReID module, wherein the detection module and the ReID module receive the characteristics from the cross-correlation module in parallel, re-identification characteristics of a target positioning frame and a positioning frame framing area in an input image can be obtained only by one-time forward reasoning, and then a post-processing module such as a Deepsort algorithm can be applied to calculate and match similarity measurement to complete tracking of two frames of images in front and at back.

Preferably, the process of decoupling the shared features by the cross-correlation module comprises: for the independence characteristics between the detection module and the ReID module, independence information learning is carried out in a self-correlation mode; for the common characteristics between the detection module and the ReID module, the common information learning is carried out in a cross-correlation mode; and fusing the self-correlation weight and the cross-correlation weight to calculate the feature representation of each task module. The common and independent feature representations for the detection module and the ReID module are learned by a cross-correlation module. For independence information learning, self-correlation (self-correlation) learns the interrelationship between different feature channels and reconstructs to enhance the feature representation of the independent modules. For the learning of the common information, the common information between the two modules is learned through a cross-correlation (cross-correlation) mechanism.

Preferably, the process of calculating the self-associated weight of each task comprises:

obtaining a multiscale shared feature F ∈ R through a neck structure^C×H×WWherein C, H and W represent the number of channels, height, and width of a feature, respectively;

obtaining smaller-scale characteristic F' epsilon R by utilizing average pooling operation^{C×H′×W′}；

Inputting into a convolutional layer to obtain characteristics T for different task modules₁And T₂；

Reshape to { M₁，M₂}∈R^C×NWherein N ' ═ H ' × W ';

different tasks are separately subjected to matrix multiplication with the transpose of the tasks, and the self-associated weight of each task is calculated by utilizing softmax

The calculation formula is as follows:

wherein,

is shown in the characteristic T_KThe ith and jth channels.

The weight calculation of each task self-association is carried out through operations of averaging pooling (avgpool), reshaping (reshape), transposing (transpose) and the like.

Preferably, the weight calculation process of cross-correlation includes:

Reshape to { M₁，M₂}∈R^C×NWherein N ' ═ H ' × W ';

at M₁And M₂Performs matrix multiplication between transposes, applies softmax layer to generate cross-correlated weights

The calculation formula is as follows:

wherein

Representing the effect of the ith channel of task module 1/2 on the jth channel of task module 2/1.

The weights of self-correlation and cross-correlation can be obtained by the above calculation formula.

Preferably, the process of computing the feature representation of each task module comprises:

the weights of self-correlation and cross-correlation are fused through a trainable parameter gamma to obtain { W₁，W₂}∈R^C×CThe calculation formula is as follows:

reshaping multiscale shared feature F to R^C×NWherein N ═ HxW;

weight in remodeling features and learning W₁，W₂}∈R^C×CPerforming matrix multiplication to obtain an enhanced representation of each task module; the enhanced representation and the multi-scale shared feature F are fused and remolded through residual calculation to obtain the final feature representation of each task module

The multi-scale shared characteristics of the structure of the decoupling head are obtained through the mutual correlation module, and the corresponding characteristic representations of the detection module and the ReID module are obtained, so that the problems of competition and granularity of the two modules are solved.

Preferably, the detection module comprises a classification branch and a regression branch, the classification of the current positioning frame is determined through the classification branch, and the position of the target is determined through the regression branch; the ReID module comprises a re-identification characteristic branch, and the re-identification characteristic of the framing area of the target positioning frame is obtained through the re-identification characteristic branch. The detection module is divided into two branches, one branch is a classification branch, and the classification of the current positioning frame is determined; the other is a regression branch, which determines the location of the target, such as the coordinates of the top left and bottom right points of the box. The ReID module acquires the re-identification characteristics of the framing area of the target positioning frame, and directly learns the characteristic distribution of the sample data by using the contrast learning idea.

Preferably, the process of calculating the loss function for re-identifying the feature branch includes:

taking an image as a key frame, and randomly selecting an image from adjacent frames of the key frame as a reference frame;

determining positive and negative samples of the two frames of images by utilizing the intersection and parallel ratio between the anchor frame and the corresponding label, wherein the re-identification characteristics of the positive and negative samples correspond to the re-identification characteristics extracted by the ReiD module;

calculating a loss function L for re-identifying a feature branch_embedThe calculation formula is as follows:

wherein V is a key frame training sample, K is a reference frame matching sample, and K is⁺For re-identifying features of positive samples, K^-Features are re-identified for negative examples.

Each frame of image contains positive and negative samples, and the matching result of the positive samples in the two frames and the positive samples from the same target is positive, otherwise, the matching result is negative.

Preferably, the calculation formula of the loss function of the whole multi-target tracking network is as follows:

L_total＝L_det+γL_embed

wherein L is_totalRepresents the loss of the entire network, L_detRepresents the loss of the detection module, L_embedRepresenting the loss of the ReID module and gamma representing the weight balance factor between the two modules. The weight balance coefficient is typically set to 0.025.

Preferably, the classification branch, the regression branch and the re-recognition feature branch each include a first portion and a second portion different from each other, the first portion includes a convolution block including 4 convolution layers, the second portion of the classification branch includes convolution layers having a convolution kernel size of 3, a step size of 1 and a channel number of C, the second portion of the regression branch includes convolution layers having a convolution kernel size of 3, a step size of 1 and a channel number of D, and the second portion of the re-recognition feature branch includes convolution layers having a convolution kernel size of 3, a step size of 1 and a channel number of E.

Wherein C represents the number of the target categories to be identified, D represents the coordinates of the upper left point and the lower right point of the positioning frame, and E represents the dimension of the re-identification feature. The first part is a convolution block which is composed of 4 convolution layers with convolution kernel size of 3, step size of 1 and channel of 256 and has the structure of BN-conv-ReLU-BN-conv-ReLU-BN-conv-ReLU-BN-conv-ReLU, wherein BN, conv and ReLU refer to batch normalization, convolution layer and modified linear unit respectively.

The substantial effects of the invention include:

(1) the multi-target tracking method provided by the invention treats the detection module and the ReID module as tasks of the same level, directly extracts the positioning frame information of the target to be identified and the corresponding re-identification characteristics from the input image, and avoids independent training and secondary reasoning processes of the ReID module.

(2) The multi-target tracking method provided by the invention has the advantages that the cross-correlation module is expressed by respective characteristics of the independent learning detection module and the ReiD module, and the problems of competition and granularity of the two modules are solved.

(3) The multi-target tracking method provided by the invention uses a contrast learning mode as a supervision signal of the ReiD module, does not need to count the number of classified example categories, and improves the model learning efficiency.

Drawings

FIG. 1 is a schematic flow chart of the main steps of the present embodiment;

fig. 2 is a schematic diagram of a network topology according to the present embodiment;

FIG. 3 is a schematic diagram of a multi-target tracking network structure according to the present embodiment;

fig. 4 is a schematic structural diagram of the cross-correlation module in this embodiment.

The method comprises the steps of 1, inputting an image, 2, sharing features, 3, a cross-correlation module, 4, feature representation, 5, a detection module, 6, a ReiD module, 7, a trunk structure, 8, a neck structure, 9 and a head structure.

Detailed Description

The following description will further specifically explain embodiments of the present invention by referring to the accompanying drawings.

A multi-target tracking method based on a comparison learning mode is disclosed, as shown in FIG. 1, and comprises the following steps:

extracting basic features of the input image 1 through a main structure 7;

fusing the bottom layer structure features and the deep layer semantic features of the main structure 7 through the neck structure 8, and outputting the obtained multi-scale feature information to the head structure 9 as the shared features 2;

the head structure 9 comprises a cross-correlation module 3, a detection module 5 and a ReID module 6, the head structure 9 decouples the received shared features 2 into two independent feature representations 4 through the cross-correlation module 3, the detection module 5 and the ReID module 6 receive corresponding features from the cross-correlation module 3 in parallel, the detection module 5 acquires a target positioning frame, and the ReID module 6 acquires re-identification features of a framed area; the detection module 5 and the ReID module 6 receive the characteristics from the cross-correlation module 3 in parallel, and re-identification characteristics of a target positioning frame and a positioning frame framing area in the input image 1 can be obtained only by one-time forward reasoning;

and calculating and matching similarity measurement through a post-processing module to complete target tracking of the image, and calculating and matching the similarity measurement by applying the post-processing module such as a Deepsort algorithm to complete tracking of two frames of images before and after the image is tracked.

The existing multi-target tracking method is based on the thought of firstly detecting and then tracking, firstly, positioning frame information of a target is determined through a detection module 5, secondly, re-identification features of a target positioning frame area are extracted through a ReID module 6, then similarity measurement is calculated for each re-identification feature, matching of the target is completed through a related matching algorithm, and the same track number is distributed to the same target.

(1) The detection module 5 and the ReID module 6 of the existing multi-target tracking method are two independent tasks and need to be trained respectively, so that the complexity of a tracking system is increased, and the complexity of calculation is increased due to the fact that the tracking system is used for multiple times in a test stage.

The re-identification characteristic idea of the multi-target tracking method is derived from a pedestrian re-identification (ReID) technology. For multi-target tracking, the input data is the whole image, and for pedestrian re-recognition, the input data is the target to be recognized in the image and is a sub-image of the whole image. Because the two input data levels are different, the two modules need to be trained independently during the training period, during the testing period, the detection module 5 is required to extract the positioning frame of each target to be recognized from the input image 1, and then the region, namely the sub-image, of each target to be recognized is sent to the ReID module 6 to obtain the re-recognition features. The ReID module 6 needs to receive the target positioning frames determined by the detection module 5, and the processing times are proportional to the number of the target positioning frames, which increases the computational complexity.

According to the multi-target tracking method, the ReiD module 6 and the detection module 5 are regarded as tasks of the same level, the positioning frame information of the target to be identified and the corresponding re-identification characteristics are directly extracted from the input image 1, the secondary reasoning process during independent training and testing during training is avoided, the computational complexity is reduced, and the computational analysis efficiency of the multi-target tracking network is improved.

(2) In the conventional multi-target tracking method, the detection module 5 and the ReID module 6 are integrated (one-shot) and required information is learned from the same feature level, but in practice, the two modules have a competitive relationship, so that the cost of network learning is increased.

The detection module 5 and the ReID module 6 belong to two different tasks, but share one feature representation 4 in an integrated (one-shot) multi-target tracking method, so that the learning cost is increased. Specifically, the one-shot (one-shot) multi-target tracking method extracts the target category confidence required by the detection module 5, the target position information and the target re-identification feature information required by the ReID module 6 from one feature representation 4, and ignores the difference between the two. The features learned by the two modules may be ambiguous, inaccurate, or may only be in competition for pursuing one task, resulting in a degradation of the performance of the other task. On the other hand, the two tasks have different information granularities, the detection module 5 needs different object feature information of the same class to have similar semantics, and the ReID module 6 needs to learn the distinguishing semantics between the two objects. However, the current multi-target tracking method does not distinguish the characteristics of the two.

The multi-target tracking method firstly utilizes a cross-correlation (cross-correlation) module to decouple the shared features of the detection module 5 and the ReID module 6 into two independent feature branches so as to independently learn respective feature representations 4, as shown in FIG. 2. These two module independent features then represent 4 self-correlation (self-correlation) and cross-correlation (cross-correlation) using the self-attention mechanism. Self-correlation (self-correlation) facilitates learning of independent modules, and cross-correlation (cross-correlation) facilitates learning of both in synergy. For this reason, the cross-correlation module 3(cross-correlation) alleviates the contention and granularity problem of the two modules.

(3) The existing multi-target tracking method directly supervises the learning of the ReID module 6 in a classification mode, and reduces the network learning efficiency.

The classification means to classify the sample into a certain category of information, and the sample is described by a predetermined category or example. Since the class or instance is determined in advance, the total number of classes or instances for all samples in the dataset is known explicitly. For a practically usable multi-target tracking data set, the number of categories or instances may vary from several tens of thousands to several hundreds of thousands, and as the image data increases, the number of categories or instances also increases. For a network, the greater the number of classes or instances, the more difficult it is to learn, or even impossible to learn.

The multi-target tracking method provided by the invention uses a contrast learning mode as a supervision signal of the ReID module 6, does not need to count the classification category or the number of instances, and improves the model learning efficiency. The core idea of contrast learning is to shorten the distance between the positive sample and to lengthen the distance between the positive sample and the negative sample, so as to directly learn the distribution state of the samples, rather than the category to which each sample belongs. The ultimate goal is that for the sample set data, it is possible to distinguish which are positive samples and which are negative samples after passing through the model.

The structural schematic diagram of the multi-target tracking method based on the comparison learning mode is shown in fig. 3, and mainly comprises three major parts, namely a main structure 7, a neck structure 8 and a head structure 9, wherein arrows indicate the flowing direction of information. Wherein the backbone structure 7 (e.g. ResNet50) is used to extract the underlying features of the input image 1; the neck structure 8 (such as FPN) fuses the bottom layer structure features and the deep semantic features of the main structure 7 to obtain multi-scale feature information, and simultaneously outputs features of different scales as the shared features 2 of the head structure 9; the head structure 9 is designed for specific tasks and comprises a cross-correlation module 3, a detection module 5 and a ReID module 6; the cross-correlation module 3 receives the shared features 2 of different scales from the neck structure 8, and decouples them into two independent feature representations 4, which are input to the detection module 5 and the ReID module 6, respectively, to avoid the problems of their mutual competition and granularity.

As can be seen from fig. 3, the detection module 5 and the ReID module 6 receive the features from the cross-correlation module 3 in parallel, and only one forward inference is needed to obtain the re-identification features of the target location box and the location box framing area in the input image 1.

The detection module 5 is divided into two branches, one branch is a classification branch, and the category of the current positioning frame is determined; the other is a regression branch, which determines the location of the target, such as the coordinates of the top left and bottom right points of the box. The ReID module 6 obtains the re-identification characteristics of the framing area of the target positioning frame, and directly learns the characteristic distribution of the sample data by using the contrast learning idea. Then, a post-processing module such as a Deepsort algorithm can be applied to calculate and match the similarity measurement to complete the tracking of the two frames of images before and after. The key point of the invention is that the ReiD module 6 and the detection module 5 represent in parallel, the cross-correlation module 3 decouples the characteristics and the contrast learning idea learns the sample distribution, so the Deepsort algorithm is not explained in detail here.

In order to improve the compactness of the multi-target tracking network structure and the consistency of training and testing, the ReID module 6 is directly embedded into the head structure 9 of the multi-target tracking network structure, and receives the features from the previous level in parallel with the detection module 5, as shown in fig. 3.

Specifically, in the header structure 9, the classification, regression and re-identification feature branches are respectively composed of two parts, the first part is a convolution block composed of 4 convolution layers with convolution kernel size of 3, step size of 1 and channel of 256, and the structure is BN-conv-ReLU-BN-conv-ReLU-BN-conv-ReLU-BN-conv-ReLU, wherein BN, conv and ReLU refer to batch normalization, convolution layer and modified linear unit respectively; the second part is a convolution layer designed for specific branches, wherein the classification branch is the convolution layer with the convolution kernel size of 3, the step size of 1 and the channel number of C, and the C represents the number of the target classes to be identified; the regression branch is a convolution layer with the convolution kernel size of 3, the stride of 1 and the channel number of 4, and 4 represents the coordinates of the upper left point and the lower right point of the positioning frame; the re-recognition feature branch is a convolution layer with a convolution kernel size of 3, a step size of 1 and 256 channel numbers, and 256 represents the dimension of the re-recognition feature.

In order to avoid the problems of competition and granularity of a detection module 5 and a ReID module 6 in an integrated (one-shot) multi-target tracking method and improve the capability of independently learning and representing features, the invention provides a novel cross-correlation module to effectively decouple a shared feature 2.

The common and independent signatures 4 for the detection module 5 and the ReID module 6 are learned by a cross-correlation module. For independence information learning, self-correlation (self-correlation) learns the interrelations between different feature channels and reconstructs to enhance the feature representation of the independent module 4. For the learning of the common information, the common information between the two modules is learned through a cross-correlation (cross-correlation) mechanism.

The detailed structure of the cross-correlation module is shown in fig. 4. Formally, assume that the multiscale shared feature 2 obtained after passing through the neck structure 8 is F ∈ R^C×H×WWherein C, H and W represent the number of channels, height, and width of a feature, respectively; then, an average pooling (avgpool) operation is utilized to obtain a smaller-scale feature F' epsilon R^{C×H′×W′}；

Then inputting the data into a convolution layer to obtain characteristics T aiming at different task modules₁And T₂；

Then reshape (reshape) them into form { M₁，M₂}∈R^C×NWherein N ' ═ H ' × W ';

the different tasks are then matrix multiplied separately with their own transpose (transpose) and the softmax is applied to calculate the weight of each task self-association (self-association)

The calculation method is as follows:

wherein,

is shown in the characteristic T_KThe ith and jth channels.

Also at M₁And M₂Also performs matrix multiplication between transposes (transposes) to learn commonality between different tasks, and then applies a softmax layer to generate cross-correlation weights

The calculation method is as follows:

wherein

The weights of self-correlation (self-correlation) and cross-correlation (cross-correlation) are fused by a trainable parameter gamma to obtain { W₁，W₂}∈R^C×CThe calculation method is as follows:

reshaping (reshape) a multiscale shared feature 2F into a shape R^C×NWherein N ═ HxW.

Then weight W in remodeling features and learning₁，W₂}∈R^C×CPerforms matrix multiplication to obtain each taskAnd performing fusion and re-reshaping (reshape) on the enhanced representation and the multi-scale shared feature 2F through residual calculation to obtain a final feature representation of each task module

To this end, by decoupling the multi-scale shared features 2 of the header structure 9 by cross-correlation (cross-correlation) modules, the feature representations 4 of the detection module 5 and the ReID module 6 are obtained, respectively, alleviating the competition and granularity problems of the two modules.

The characterization 4 of the detection module 5 and the ReID module 6, respectively, is obtained by a cross-correlation module

And

then, the feature is expressed 4

Input to the classification branch and the regression branch,

inputting the result into the re-recognition characteristic branch, and obtaining classification, regression and re-recognition characteristic results P respectively according to the structure_cls，P_regAnd P_embed；

Thereafter, with the label G of the corresponding branch_cls，G_regAnd G_embedAnd constructing a loss function, and updating the network weight through a back propagation algorithm to train the network.

Similar to the common detection task for classification and regression branches, the loss functions use FocalLoss and GIou, respectively.

For re-identifying the characteristic branch, the invention directly learns the characteristic distribution of the sample data by using the contrast learning idea, and provides a new loss function L_embedThe following.

As shown in fig. 3, one image is given as a key frame, and one image is randomly selected as a reference frame from adjacent frames of the key frame. The intersection ratio (Iou) between the anchor frame (anchor) and the corresponding label is used to determine the positive and negative samples of the two-frame image, the re-identification features of the positive and negative samples corresponding to the re-identification features extracted by the ReID module 6. Each frame of image contains positive and negative samples, and the matching result of the positive samples in the two frames and the positive samples from the same target is positive, otherwise, the matching result is negative.

Assuming that the key frame has V samples for training and the reference frame has K samples for matching target, for each training sample, the cross-entropy function is used to optimize the matching result, and the calculation formula is as follows:

V，K⁺and K^-Respectively are the re-identification characteristics of the training sample, the positive sample and the negative sample.

The overall loss is the average of all the losses of the training samples, and the above shows the loss of only one sample, i.e. the loss of only one positive sample match. For all samples on two frames, i.e. each sample of the key frame needs to match the sample of the reference frame, therefore, the training sample of the key frame does not necessarily have only one positive target in the reference frame, so the calculation formula is changed to:

however, the above formula negative samples are considered many times, while the positive samples are considered only once, so the formula is continuously rewritten as:

thus, in the case of multiple positive samples, it can be expanded to:

therefore, the whole multi-target tracking network can be optimized to be a multi-task loss function:

L_total＝L_det+γL_embed

wherein L is_totalRepresents the loss of the entire network, L_detRepresents the loss of the detection module 5, L_embedRepresenting the loss of ReID block 6 and gamma representing the weight balance factor between the two blocks, set to 0.025.

ReID, namely pedestrian re-identification (ReID), also called pedestrian re-identification in the present solution is a technique for determining whether a specific pedestrian exists in an image or a video sequence by using a computer vision technique. Is widely considered as a sub-problem for image retrieval. Given a monitored pedestrian image, the pedestrian image is retrieved across the device. The visual limitation of a fixed camera is overcome, the pedestrian detection/pedestrian tracking technology can be combined, and the method can be widely applied to the fields of intelligent video monitoring, intelligent security and the like.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A multi-target tracking method based on a comparison learning mode is characterized by comprising the following steps:

extracting basic features of the input image (1) through a main structure (7);

fusing bottom layer structural features and deep semantic features of the main structure (7) through the neck structure (8), and outputting obtained multi-scale feature information to the head structure (9) as shared features (2);

the head structure (9) decouples the received shared features (2) into two independent feature representations (4) through the cross-correlation module (3), the detection module (5) and the ReID module (6) receive corresponding features from the cross-correlation module (3) in parallel, a target positioning frame is obtained through the detection module (5), and re-identification features of a framed area are obtained through the ReID module (6);

2. The multi-target tracking method based on the contrast learning mode as claimed in claim 1, wherein the process of decoupling the shared features (2) by the cross-correlation module (3) comprises: for the independence characteristics between the detection module (5) and the ReID module (6), independence information learning is carried out in a self-correlation mode; for the common characteristics between the detection module (5) and the ReID module (6), common information learning is carried out in a cross-correlation mode; and fusing the self-correlation weight and the cross-correlation weight to calculate the feature representation (4) of each task module.

3. The multi-target tracking method based on the comparative learning mode as claimed in claim 2, wherein the process of calculating the self-associated weight of each task comprises the following steps:

obtaining a multiscale shared feature (2) by a neck structure (8) F ∈ R^C×H×WWhere C, H and W represent the number of channels, height and width of the feature, respectively;

Reshape to { M₁，M₂}∈R^C×N′Wherein N ' ═ H ' × W ';

The calculation formula is as follows:

wherein,

is shown in the characteristic T_KThe ith and jth channels.

4. The multi-target tracking method based on the comparative learning mode as claimed in claim 2 or 3, wherein the cross-correlation weight calculation process comprises:

obtaining a multiscale shared feature (2) by a neck structure (8) F ∈ R^C×H×WWherein C, H and W represent the number of channels, height, and width of a feature, respectively;

Reshape to { M₁，M₂}∈R^C×N′Wherein N ' ═ H ' × W ';

The calculation formula is as follows:

wherein

5. The multi-target tracking method based on the comparative learning mode according to claim 2 or 3, wherein the process of calculating the feature representation (4) of each task module comprises the following steps:

reshaping multiscale shared feature (2) F to R^C×NWherein N ═ HxW;

weight in remodeling features and learning W₁，W₂}∈R^C×CMatrix multiplication is carried out between the task modules to obtain the enhanced representation of each task module; the enhanced representation and the multi-scale shared feature (2) F are fused and reshaped through residual calculation to obtain the final feature representation (4) of each task module

6. The multi-target tracking method based on the comparison learning mode is characterized in that the detection module (5) comprises a classification branch and a regression branch, the classification of the current positioning frame is determined through the classification branch, and the position of the target is determined through the regression branch; the ReID module (6) comprises a re-identification characteristic branch, and the re-identification characteristic of the framing area of the target positioning frame is obtained through the re-identification characteristic branch.

7. The multi-target tracking method based on the comparative learning mode as claimed in claim 6, wherein the calculation process of the loss function for re-identifying the characteristic branches comprises the following steps:

determining positive and negative samples of the two frames of images by utilizing the intersection and combination ratio between the anchor frame and the corresponding label, wherein the re-identification characteristics of the positive and negative samples correspond to the re-identification characteristics extracted by the ReiD module (6);

8. The multi-target tracking method based on the contrast learning mode as claimed in claim 7, wherein the loss function calculation formula of the whole multi-target tracking network is as follows:

L_total＝L_det+γL_embed

wherein L is_totalRepresents the loss of the entire network, L_detRepresents the loss of the detection module (5), L_embedRepresents the loss of the ReID module (6) and gamma represents the weight balance factor between the two modules.

9. The multi-target tracking method based on the contrast learning manner as claimed in claim 6, wherein the classification branch, the regression branch and the re-recognition feature branch each include a first portion and a second portion different from each other, the first portion includes a convolution block formed by 4 convolution layers, the second portion of the classification branch includes convolution layers with convolution kernel size of 3, step size of 1 and channel number of C, the second portion of the regression branch includes convolution layers with convolution kernel size of 3, step size of 1 and channel number of D, and the second portion of the re-recognition feature branch includes convolution layers with convolution kernel size of 3, step size of 1 and channel number of E.