CN117877068B

CN117877068B - Mask self-supervision shielding pixel reconstruction-based shielding pedestrian re-identification method

Info

Publication number: CN117877068B
Application number: CN202410016648.1A
Authority: CN
Inventors: 李骜; 邵春锐; 谢委衡; 杨海陆; 程媛
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-09-20
Anticipated expiration: 2044-01-04
Also published as: CN117877068A

Abstract

The invention provides a mask self-supervision occlusion pixel reconstruction-based pedestrian re-identification method, and belongs to the field of pedestrian re-identification in multimedia information processing. The method comprises a mask self-encoder fine-tuning image complement model based on mask guidance and a blocked pedestrian re-recognition network based on dynamic graph-graph convolution. Firstly, the image complement model performs self-supervision training by randomly deleting image blocks and generating complete pictures through residual image blocks, and the difference between the generated pictures and original pictures is reduced by using mean square error loss. And then training the blocked pedestrian re-recognition network, and training together by using the triplet loss, the ID loss and the center loss to obtain robust and discriminant features. And in the test process, the image with shielding is complemented by using an image complement model and a mask guiding method, and the pixels of the pedestrian body of the part of the image shielded by the obstacle are reconstructed. And then, inputting the completed pedestrian image into a pedestrian re-recognition blocking network to obtain the pedestrian characteristics, and implementing pedestrian re-recognition. Compared with other methods, the method and the device have the advantages that the accuracy of shielding the re-identification of the pedestrians is obviously improved.

Description

A method for re-identification of occluded pedestrians based on masked self-supervised reconstruction of occluded pixels

技术领域Technical Field

本发明属于计算机视觉中的行人重识别领域，具体涉及一种基于掩码自监督遮挡像素重建的遮挡行人重识别方法。The present invention belongs to the field of pedestrian re-identification in computer vision, and in particular relates to an occluded pedestrian re-identification method based on mask self-supervised occluded pixel reconstruction.

背景技术Background Art

行人重识别是一项检索被不同摄像头捕捉到的特定行人的任务，在监控系统中具有至关重要的地位，因而引起广泛关注。然而，由于在数据采集过程中存在摄像头位置差异、低分辨率、光照变化以及障碍物遮挡等挑战，导致现有行人重识别方法缺乏鲁棒性，识别准确率相对较低。其中，遮挡问题是当前行人重识别领域中最困难问题之一。Person re-identification is a task to retrieve specific pedestrians captured by different cameras. It plays a vital role in surveillance systems and has attracted widespread attention. However, due to challenges such as camera position differences, low resolution, illumination changes, and obstacle occlusion during data collection, existing person re-identification methods lack robustness and have relatively low recognition accuracy. Among them, the occlusion problem is one of the most difficult problems in the current field of person re-identification.

目前，基于深度学习的方法多依赖于人体关键点检测或人体结构检测。然而，由于遮挡问题，关键点或结构往往被其他障碍物遮挡，导致现有方法无法准确识别。而且，较为相似的障碍物会使得不同行人图片中出现相似的内容，导致神经网络提取特征的判别性大大降低，进一步影响了行人重识别方法的准确性和鲁棒性。因此，需要一种能够有效克服遮挡问题的新型行人重识别技术，以提高在复杂场景下的识别性能。At present, most deep learning-based methods rely on human key point detection or human structure detection. However, due to occlusion problems, key points or structures are often blocked by other obstacles, resulting in the inability of existing methods to accurately identify them. Moreover, similar obstacles will cause similar content to appear in different pedestrian images, which greatly reduces the discriminability of neural network feature extraction, further affecting the accuracy and robustness of pedestrian re-identification methods. Therefore, a new pedestrian re-identification technology that can effectively overcome the occlusion problem is needed to improve recognition performance in complex scenes.

发明内容Summary of the invention

为了解决上述问题，本发明提供了一种基于掩码自监督遮挡像素重建的遮挡行人重识别方法，所述方法包括步骤：In order to solve the above problems, the present invention provides a method for re-identifying occluded pedestrians based on mask self-supervised reconstruction of occluded pixels, the method comprising the steps of:

从监控摄像装置采集训练数据A、训练数据B与测试数据。其中，训练数据A包括有遮挡的行人重识别图片与实例分割级的标注与行人编号。训练数据B包括有较为完整人体部分的行人重识别图片。测试数据分为查询图片与库图片，所述的测试数据中只包含原始有遮挡的行人重识别图片与行人编号。The training data A, training data B and test data are collected from the surveillance camera device. The training data A includes the pedestrian re-identification pictures with occlusion and the annotations and pedestrian numbers at the instance segmentation level. The training data B includes the pedestrian re-identification pictures with relatively complete human body parts. The test data is divided into query pictures and library pictures, and the test data only includes the original pedestrian re-identification pictures with occlusion and pedestrian numbers.

首先，使用训练数据A与训练数据B将所述图像补全模型训练至收敛。将测试图像输入收敛的图像补全模型中，对图像中被遮挡的部分行人像素进行重建，得到去遮挡的行人图像。然后，使用训练数据B将所述遮挡行人重识别网络中训练至收敛。最后，将测试数据输入图像补全模块并得到去遮挡的行人图像。将去遮挡的行人图像输入所述行人重识别网络，可得到较好的行人识别精度。First, the image completion model is trained to convergence using training data A and training data B. The test image is input into the converged image completion model, and the obscured pedestrian pixels in the image are reconstructed to obtain a de-occluded pedestrian image. Then, the obscured pedestrian re-identification network is trained to convergence using training data B. Finally, the test data is input into the image completion module and a de-occluded pedestrian image is obtained. The de-occluded pedestrian image is input into the pedestrian re-identification network to obtain a better pedestrian recognition accuracy.

所述的基于掩码指导的掩码自编码器微调图像补全模型，其特征在于对所述的有遮挡的行人图片进行去遮挡处理，所述的补全模型训练过程如下：The mask-guided masked autoencoder fine-tuning image completion model is characterized by performing de-occlusion processing on the occluded pedestrian image. The completion model training process is as follows:

(1)使用训练数据A，对现有实例分割网络Mask2former训练至收敛，得到可以输出行人图片中人体掩码的实例分割网络。所述的收敛的实例分割网络可以预测身份为i的图片X_i对应的行人掩码M_i。其中，原图片X_i中未被遮挡的行人像素在对应的行人掩码M_i图片位置中的像素为白色，其他像素为黑色。(1) Using training data A, the existing instance segmentation network Mask2former is trained until convergence, and an instance segmentation network that can output human body masks in pedestrian images is obtained. The converged instance segmentation network can predict the pedestrian mask _Mi corresponding to the image _Xi with identity i. Among them, the pixels of pedestrians that are not blocked in the original image _Xi in the corresponding pedestrian mask _Mi image position are white, and the other pixels are black.

(2)使用自监督掩码指导的图像建模网络对有遮挡的行人图像中被遮挡的部分行人像素进行重建。将所述有遮挡的行人图片与图片对应掩码X_i,(H与W为图片的高与宽，3为RGB图片的三个维度)使用图像分块函数转换为图像块嵌入，其中每个图像块的高宽为P_h与P_w。对于图片X_i，图像分块函数为对每个图片X_i进行卷积，卷积核大小为P_h与P_w，且步长同样为P_h与P_w，每个图像块的输出维度为C。对图片使用图像分块函数转换为图像块嵌入并拉平得到的图像块嵌入如下：(2) Use the image modeling network guided by the self-supervised mask to reconstruct the obscured pedestrian pixels in the obscured pedestrian image. The obscured pedestrian image and the _mask corresponding to the image are reconstructed. (H and W are the height and width of the image, and 3 is the three dimensions of the RGB image) Use the image block function to convert it to an image block embedding, where the height and width of each image block are _Ph and _Pw . For the image _Xi , the image block function is to convolve each image _Xi , the convolution kernel size is _Ph and _Pw , and the step size is also _Ph and _Pw , and the output dimension of each image block is C. The image is converted to an image block embedding using the image block function and flattened to obtain the image block embedding as follows:

X_iP＝Patch(X_i,θ) _XiP = Patch( _Xi ,θ)

M_iP＝Random(M_i)M _iP = Random(M _i )

其中，Patch是图像分块函数。其中C为维度数量。(P_h,P_w)表示每个图像块的分辨率，是图像块的总数，θ为可学习参数。Random函数根据输入掩码图像大小计算图像块的数量，并随机生成对应数量的像素留存得分。M_iP表示当前图片X_i中每个图像块的像素留存得分，其取值从0到105，像素留存得分小于60的图像块会被标记为重建块，被标记的图像块对应的图像块嵌入被丢弃且不会进入编码器部分，在训练阶段，M_iP使用随机函数生成。Among them, Patch is the image block function. Where C is the number of dimensions. (P _h ,P _w ) represents the resolution of each image block. is the total number of image blocks, and θ is a learnable parameter. The Random function calculates the number of image blocks according to the input mask image size and randomly generates the corresponding number of pixel retention scores. _{M iP} represents the pixel retention score of each image block in the current image Xi _, and its value ranges from 0 to 105. Image blocks with pixel retention scores less than 60 will be marked as reconstructed blocks. The image block embedding corresponding to the marked image blocks will be discarded and will not enter the encoder part. During the training phase, M _iP is generated using a random function.

(3)将所述的未丢弃图像块嵌入与图像块位置编码相加后输入像素重建编码器。所述的位置编码使用2D位置编码公式如下：(3) The non-discarded image block embedding and the image block position coding are added and then input into the pixel reconstruction encoder. The position coding uses the 2D position coding formula as follows:

所述的pos_X与pos_Y是图像块在原图片中的横坐标与纵坐标。所述的d_model是位置编码的维度数，在本发明中d_model与图像块嵌入C相同。i的取值为从0到0.5C-1的正整数。位置编码的每一项计算完毕后，将所得的编码按如下方式排列，得到所述的2D位置编码：PE(pos_X,0),PE(pos_Y,1),PE(pos_X,2),PE(pos_Y,3)...PE(pos_X,C-2),PE(pos_Y,C-1)。The pos _X and pos _Y are the horizontal and vertical coordinates of the image block in the original image. The d _model is the dimension of the position code, and in the present invention, the d _model is the same as the image block embedding C. The value of i is a positive integer from 0 to 0.5C-1. After each item of the position code is calculated, the obtained codes are arranged in the following manner to obtain the 2D position code: PE(pos _X , 0), PE(pos _Y , 1), PE(pos _X , 2), PE(pos _Y , 3)...PE(pos _X , C-2), PE(pos _Y , C-1).

将原图像块嵌入输入重建编码器，得到中间张量。在中间张量中的被丢弃像素块位置插入可学习张量，得到待学习中间张量。所述的可学习张量的形状与被替换的图像块嵌入形状相同，为C代表图像块嵌入的维度数。将待学习中间张量输入解码器，得到重建后图像块嵌入，将所述的重建后图像块嵌入解包后得到重建后的图像。The original image block is embedded into the input reconstruction encoder to obtain an intermediate tensor. The learnable tensor is inserted into the position of the discarded pixel block in the intermediate tensor to obtain the intermediate tensor to be learned. The shape of the learnable tensor is the same as the embedding shape of the replaced image block, which is C represents the number of dimensions of the image block embedding. The intermediate tensor to be learned is input into the decoder to obtain the reconstructed image block embedding, and the reconstructed image block embedding is unpacked to obtain the reconstructed image.

在训练过程中，使用训练数据B与其掩码图像对自监督掩码指导的图像建模网络进行训练。具体的，将每张行人图片的掩码中黑色像素覆盖于对应行人图片上，白色像素不做处理。得到训练数据B_withMask。使用训练数据B_withMask对所述的图像建模网络进行自监督训练。During the training process, the image modeling network guided by the self-supervised mask is trained using the training data B and its mask image. Specifically, the black pixels in the mask of each pedestrian image are covered on the corresponding pedestrian image, and the white pixels are not processed. The training data B_withMask is obtained. The image modeling network is self-supervised trained using the training data B_withMask.

自监督掩码指导的图像建模网络仅对像素留存得分小于60的图像块计算损失，采用均方误差作为损失函数。The self-supervised mask-guided image modeling network only calculates the loss for image patches with a pixel retention score less than 60, and uses the mean square error as the loss function.

进一步地，在预测过程中，将测试数据送入收敛的实例分割网络，得到测试图片掩码M_iTest。将测试数据与测试图片掩码M_iTest共同输入所述的图像建模网络，得到去遮挡的行人图像。所述的预测过程中图像建模网络流程如下：Furthermore, during the prediction process, the test data is fed into the converged instance segmentation network to obtain the test image mask _MiTest . The test data and the test image mask _MiTest are fed into the image modeling network to obtain a de-occluded pedestrian image. The image modeling network process during the prediction process is as follows:

将行人图片与测试图片掩码M_iTest共同输入图像分块函数，得到图像块嵌入并拉平得到的图像块嵌入与像素留存得分M_iP。像素留存得分M_iP小于60的图像块将被丢弃。将未丢弃图像块嵌入与图像块位置编码相加，输入像素重建编码器，得到中间张量。在中间张量中的被丢弃像素块位置插入可学习张量，得到待学习中间张量。将待学习中间张量输入解码器，得到重建后图像块嵌入，将所述的重建后图像块嵌入解包后得到重建后的图像。The pedestrian image and the test image mask _MiTest are input into the image segmentation function together to obtain the image block embedding and the flattened image block embedding and the pixel retention score _MiP . The image blocks with a pixel retention score _MiP less than 60 will be discarded. The non-discarded image block embedding is added to the image block position code and input into the pixel reconstruction encoder to obtain an intermediate tensor. The learnable tensor is inserted into the position of the discarded pixel block in the intermediate tensor to obtain the intermediate tensor to be learned. The intermediate tensor to be learned is input into the decoder to obtain the reconstructed image block embedding, and the reconstructed image block embedding is unpacked to obtain the reconstructed image.

预测过程中仅有像素留存得分M_iP的获得方式与训练过程不同。所述的测试过程中像素留存得分M_iP的获得方式，如下公式：The only difference between the prediction process and the training process is the way to obtain the pixel retention score M _iP . The method for obtaining the pixel retention score M _iP in the test process is as follows:

M_iP＝Patch_formask(M_iTest,1) _MiP =Patch_formask( _MiTest ,1)

M_iP表示当前图片X_i中所有图像块的像素留存得分，每个得分取值从0到105，像素留存得分小于60的图像块会被标记为重建块，被标记的图像块对应的X_iP被丢弃且不会进入编码器部分。M _iP represents the pixel retention score of all image blocks in the current image Xi _. Each score ranges from 0 to 105. Image blocks with pixel retention scores less than 60 will be marked as reconstructed blocks. _{The XiP} corresponding to the marked image blocks will be discarded and will not enter the encoder part.

Patch_formask是与Patch函数结构相同的图像分块函数，它们的区别在于Patch函数中卷积核的可学习参数为θ，Patch_formask函数中卷积核中的参数都不可学习且固定为1。像素留存得分的输出维度数为1，即每个图像块对应一个像素留存得分。Patch_formask函数中1代表卷积核中的参数都不可学习且固定为1。Patch_formask is an image segmentation function with the same structure as the Patch function. The difference between them is that the learnable parameter of the convolution kernel in the Patch function is θ, while the parameters in the convolution kernel in the Patch_formask function are not learnable and are fixed to 1. The output dimension of the pixel retention score is 1, that is, each image patch corresponds to a pixel retention score. In the Patch_formask function, 1 means that the parameters in the convolution kernel are not learnable and are fixed to 1.

进一步地，所述的行人重识别网络使用动态图结构模块与图卷积特征传播模块辅助传播特征。所述的动态图结构模块在卷积神经网络中的多个位置建立不同的图结构，即在卷积神经网络中的多个位置将特征图转化为K最近邻图结构。所述动态图结构模块输入高为H，宽为W的特征图，输出该特征图对应的邻接矩阵。所述动态图结构模块的伪代码如下：Furthermore, the pedestrian re-identification network uses a dynamic graph structure module and a graph convolution feature propagation module to assist in propagating features. The dynamic graph structure module establishes different graph structures at multiple positions in the convolutional neural network, that is, converts the feature graph into a K nearest neighbor graph structure at multiple positions in the convolutional neural network. The dynamic graph structure module inputs a feature graph with a height of H and a width of W, and outputs an adjacency matrix corresponding to the feature graph. The pseudo code of the dynamic graph structure module is as follows:

进一步地，计算特征图中的每个结点的相关性，得到相关矩阵，公式如下：Furthermore, the correlation of each node in the feature graph is calculated to obtain the correlation matrix. The formula is as follows:

R＝θ(F)φ(F)^T R＝θ(F)φ(F) ^T

其中，R是相关矩阵，是卷积层输出的特征图，C是特征图的维度数，W与H分别是特征图的宽与高。X^T表示矩阵X的转置。θ(F)和φ(F)表示将特征图F分别输入具有完全相同的结构但不同参数的两个传递函数。所述传递函数由一个1×1卷积层、一个批量归一化层和一个ReLU激活函数构成。Where R is the correlation matrix, is the feature map output by the convolutional layer, C is the number of dimensions of the feature map, and W and H are the width and height of the feature map, respectively. ^XT represents the transpose of the matrix X. θ(F) and φ(F) represent the input of the feature map F into two transfer functions with exactly the same structure but different parameters. The transfer function consists of a 1×1 convolutional layer, a batch normalization layer, and a ReLU activation function.

接着，将相关矩阵R与邻接矩阵A相乘，并使用softmax函数进行归一化，得到相似度邻接矩阵，公式如下：Next, multiply the correlation matrix R by the adjacency matrix A and normalize it using the softmax function to obtain the similarity adjacency matrix. The formula is as follows:

其中，是相似度邻接矩阵，N为特征图中的结点数，A为动态图结构模块输出的邻接矩阵，⊙表示矩阵的哈达玛积。in, is the similarity adjacency matrix, N is the number of nodes in the feature graph, A is the adjacency matrix output by the dynamic graph structure module, and ⊙ represents the Hadamard product of the matrix.

然后，在图结构A上进行节点特征传播，公式如下：Then, node feature propagation is performed on the graph structure A, and the formula is as follows:

其中，是经过特征传播后的特征，是卷积层输出的特征图经矩阵变换得到的特征。通过矩阵转换，传播后的特征可以重新转换为特征图。in, is the feature after feature propagation, It is the feature obtained by matrix transformation of the feature map output by the convolution layer. Through matrix transformation, the propagated features can be converted back into feature maps.

将上述动态图结构模块与图卷积特征传播模块的所有过程记为OGA(F)。其中，是卷积层输出的特征图。All the processes of the above dynamic graph structure module and graph convolution feature propagation module are denoted as OGA(F). It is the feature map output by the convolutional layer.

进一步地，将残差结构引入特征传播过程。同时，OGA模块进行多次堆叠能使特征进行更充分地传播，所述的有残差结构的OGA模块堆叠方式如下：Furthermore, the residual structure is introduced into the feature propagation process. At the same time, multiple stacking of OGA modules can make the features propagate more fully. The stacking method of the OGA modules with residual structure is as follows:

其中β代表可学习参数。Where β represents the learnable parameter.

进一步地，在训练过程中，所述的遮挡行人重识别网络的损失函数表达式为：Furthermore, during the training process, the loss function of the occluded person re-identification network is expressed as:

L＝L_ID+L_Triplet+εL_C L＝L _ID +L _Triplet +εL _C

其中，L_Triplet是三元组损失，L_ID是ID损失，L_C是Center损失。ε是Center损失的平衡权重，在本网络中，其设置为0.1。Among them, L _Triplet is the triplet loss, L _ID is the ID loss, and L _C is the Center loss. ε is the balance weight of the Center loss, which is set to 0.1 in this network.

所述的L_Triplet表达式如下：The L _Triplet expression is as follows:

上式中B是一个训练小批量样本数，表示基准样本，表示与基准样本相同类别但不同的正样本，表示与基准样本不同类别的负样本。α表示设定的训练间隔，在本发明中设置为0.2。f(x)为图片x的特征。In the above formula, B is the number of training mini-batch samples. represents the benchmark sample, represents a positive sample of the same category but different from the benchmark sample, represents a negative sample of a different category from the benchmark sample. α represents a set training interval, which is set to 0.2 in the present invention. f(x) is the feature of image x.

所述的L_ID表达式如下：The L _ID expression is as follows:

在上式中，所述y表示所述训练样本的真实标签值。N表示数据集中行人身份数量。f(x_i)表示网络对图片x_i预测得到的嵌入。在本发明中ε值为常量，设置为0.1。In the above formula, y represents the true label value of the training sample. N represents the number of pedestrian identities in the data set. f( _xi ) represents the embedding predicted by the network for the image _xi . In the present invention, the ε value is a constant, set to 0.1.

所述的L_C表达式如下：The L _C expression is as follows:

上式中f(x_i)为图片x_i的特征。B是一个训练小批量样本数。表示类别为y_i的所有特征的中心，即小批量中所有类别为y_i的图片的特征的平均值。In the above formula, f( _xi ) is the feature of image _xi . B is the number of training mini-batch samples. Represents the center of all features of category _yi , that is, the average value of the features of all images of category _yi in the mini-batch.

进一步地，在全部网络训练至收敛后，将测试数据送入基于掩码指导的掩码自编码器微调图像补全模型，得到去遮挡的行人图像。然后，将去遮挡的行人图像送入基于动态图与图卷积的遮挡行人重识别网络，得到每张图片的特征。对每个查询图片，按特征距离进行排序，距离越近的库图像会被排在前面，取距离最近的10个库图片作为查询图片的查询结果。将具有相同身份标签的库图片作为该查询图匹配到的正确结果，并且计算该查询图片的平均准确率。计算所有查询图片的平均准确率，得到本发明在该数据集上的平均准确率均值。计算所有查询图片特征距离最近库图片正确率的均值，得到首位命中率。Furthermore, after all networks are trained to convergence, the test data is sent to a masked autoencoder fine-tuned image completion model based on mask guidance to obtain a de-occluded pedestrian image. Then, the de-occluded pedestrian image is sent to an occluded pedestrian re-identification network based on dynamic graphs and graph convolutions to obtain the features of each image. For each query image, sort by feature distance, and the library images with closer distances will be ranked in front, and the 10 library images with the closest distances are taken as the query results of the query image. The library images with the same identity label are taken as the correct results matched to the query graph, and the average accuracy of the query image is calculated. The average accuracy of all query images is calculated to obtain the average accuracy mean of the present invention on the data set. The average of the accuracy of the library images with the closest feature distances of all query images is calculated to obtain the first hit rate.

本发明提供了一种掩码自监督遮挡像素重建的遮挡行人重识别方法，具有以下优势：The present invention provides a masked pedestrian re-identification method for self-supervised reconstruction of occluded pixels, which has the following advantages:

(1)所述方法的自监督掩码指导的图像建模网络使用自监督的方式进行训练，其避免了大规模标注图像数据的时间浪费，并且其效果较有监督方法性能持平。(1) The self-supervised mask-guided image modeling network of the method is trained in a self-supervised manner, which avoids the time waste of large-scale labeling of image data, and its effect is comparable to that of supervised methods.

(2)所述方法采用了像素补全技术，使得网络可以将遮挡数据中被遮挡的行人完全的补全，使得现有的基于人体关键点的方法能够适用于被遮挡的图像。(2) The method adopts pixel completion technology, so that the network can completely complete the occluded pedestrians in the occluded data, so that the existing method based on human key points can be applied to occluded images.

(3)所述方法采用采用基于动态图神经和图卷积结合的遮挡行人重识别网络，有效的利用了图像空间信息，动态图神经网络有利于增大卷积神经网络的有效感受野，提高了对于不同场景和遮挡情况下的行人识别性能。(3) The method adopts an occluded pedestrian re-identification network based on a combination of dynamic graph neural networks and graph convolution, which effectively utilizes the image spatial information. The dynamic graph neural network is conducive to increasing the effective receptive field of the convolutional neural network and improving the pedestrian recognition performance in different scenes and occlusion conditions.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1是本发明提供的一种掩码自监督遮挡像素重建的遮挡行人重识别方法的流程图；FIG1 is a flow chart of a method for re-identifying an occluded person by masked self-supervised reconstruction of occluded pixels provided by the present invention;

图2是所述的补全模型(包括实例分割网络与图像建模网络)的示意图；FIG2 is a schematic diagram of the completion model (including an instance segmentation network and an image modeling network);

图3是补全模型对遮挡图像进行补全效果的示意图；FIG3 is a schematic diagram of the completion model's completion effect on an occluded image;

图4是基于动态图和图卷积的遮挡行人重识别网络结构的示意图；FIG4 is a schematic diagram of a network structure for occluded person re-identification based on dynamic graphs and graph convolution;

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚明了，下面结合具体实施方式并参照附图，对本发明进一步详细说明。应该理解，这些描述只是示例性的，而并非要限制本发明的范围。此外，在以下说明中，省略了对公知结构和技术的描述，以避免不必要地混淆本发明的概念。In order to make the purpose, technical scheme and advantages of the present invention clearer, the present invention is further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings. It should be understood that these descriptions are only exemplary and are not intended to limit the scope of the present invention. In addition, in the following description, the description of well-known structures and technologies is omitted to avoid unnecessary confusion of the concept of the present invention.

示例性方法Exemplary Methods

如图1，本发明提供了基于掩码自监督遮挡像素重建的遮挡行人重识别方法流程图，所述方法步骤如下：As shown in FIG1 , the present invention provides a flow chart of a method for re-identifying an occluded person based on mask self-supervised reconstruction of occluded pixels, and the method steps are as follows:

步骤S110：首先将数据集中的测试集图片全部保留，将训练数据分为存在较完整行人人体结构图像(记为训练数据B)与存在较大遮挡物的行人重识别图像(记为训练数据A)。此外，将训练数据A图像中的行人进行实例分割级标记。将所有的数据都转换为高210像素，宽为98像素。Step S110: First, all test set images in the data set are retained, and the training data is divided into images with relatively complete pedestrian human body structures (referred to as training data B) and pedestrian re-identification images with large occlusions (referred to as training data A). In addition, pedestrians in the training data A images are labeled at the instance segmentation level. All data are converted to 210 pixels in height and 98 pixels in width.

步骤S120：通过训练数据A对实例分割网络进行训练，实例分割网络使用Mask2former网络，损失函数使用Diceloss。完成后将训练数据B中的图像送入实例分割网络得到训练数据B中每个图像的行人掩码。Step S120: Train the instance segmentation network using the training data A. The instance segmentation network uses the Mask2former network and the loss function uses Diceloss. After completion, the images in the training data B are sent to the instance segmentation network to obtain the pedestrian mask of each image in the training data B.

使用训练数据B与其掩码图像对自监督掩码指导的图像建模网络进行训练至收敛。具体的，将每张行人图片的掩码中黑色像素覆盖于对应行人图片上，白色像素不做处理。得到训练数据B_withMask。使用训练数据B_withMask对所述的图像建模网络进行自监督训练(自监督掩码指导的图像建模网络训练不使用图像掩膜，M_iP使用随机函数生成)。The self-supervised mask-guided image modeling network is trained until convergence using the training data B and its mask image. Specifically, the black pixels in the mask of each pedestrian image are covered on the corresponding pedestrian image, and the white pixels are not processed. The training data B_withMask is obtained. The image modeling network is self-supervisedly trained using the training data B_withMask (the image modeling network training guided by the self-supervised mask does not use the image mask, and the M _iP is generated using a random function).

图像建模网络将预测图片与对应掩码X_i,(H与W为图片高度与宽度，3是RGB图片的维度)转换为多个图像块，其中每个图像块的高为15像素，宽为7像素。对于图片X_i，Patch操作为对每个图片X_i进行卷积，卷积核大小高为15，宽为7，且步长同样高为15，宽为7，其每个图像块的输出维度为C。对图像转换为图像块并拉平得到的图像块嵌入结果如下：The image modeling network will predict the image and the corresponding mask _Xi , (H and W are the image height and width, and 3 is the dimension of the RGB image) is converted into multiple image blocks, where each image block is 15 pixels high and 7 pixels wide. For the image _Xi , the Patch operation is to convolve each image _Xi , with a convolution kernel size of 15 high and 7 wide, and a stride of 15 high and 7 wide. The output dimension of each image block is C. The image is converted into image blocks and flattened to get the image block embedding result as follows:

X_iP＝Patch(X_i,θ) _XiP = Patch( _Xi ,θ)

M_iP＝Random(M_i)M _iP = Random(M _i )

其中C为维度数量，在本发明中C为768。掩码的维度数量为1。(P_h,P_w)表示每个图像块的分辨率即(15，7)，是图像块的总数，在本发明中为196。Patch是图像分块函数，Patch函数中卷积核的可学习参数为θ。Random函数根据输入掩码图像大小计算图像块的数量，并随机生成对应数量的像素留存得分。所述的M_iP表示当前图片X_i中每个图像块的像素留存得分范围从0到105，像素留存得分小于60的图像块会被标记为重建块，被标记的图像块对应的图像块嵌入被丢弃且不会进入编码器部分。训练过程中M_iP随机生成。 Where C is the number of dimensions, and in the present invention, C is 768. The number of dimensions of the mask is 1. (P _h ,P _w ) represents the resolution of each image block, i.e., (15, 7). is the total number of image blocks, which is 196 in the present invention. Patch is an image block function, and the learnable parameter of the convolution kernel in the Patch function is θ. The Random function calculates the number of image blocks according to the input mask image size, and randomly generates a corresponding number of pixel retention scores. The M _iP represents the pixel retention score of each image block in the current image Xi _, ranging from 0 to 105. Image blocks with pixel retention scores less than 60 will be marked as reconstructed blocks, and the image block embedding corresponding to the marked image blocks will be discarded and will not enter the encoder part. M _iP is randomly generated during training.

将所述的未丢弃图像块嵌入与图像块位置编码相加后输入像素重建编码器。所述的位置编码使用2D位置编码公式如下：The non-discarded image block embedding and the image block position coding are added and then input into the pixel reconstruction encoder. The position coding uses the 2D position coding formula as follows:

所述的pos_X与pos_Y是图像块在原图片中的横坐标与纵坐标。所述的d_model是位置编码的维度数，在本发明中d_model与图像块嵌入C相同。i的取值为从0到0.5C-1的正整数。位置编码的每一项计算完毕后，将所得的编码按如下方式排列，得到所述的2D位置编码：PE(pos_X,0),PE(pos_Y,1),PE(pos_X,2),PE(pos_Y,3)...PE(pos_X,766),PE(pos_Y,767)。The pos _X and pos _Y are the horizontal and vertical coordinates of the image block in the original image. The d _model is the dimension of the position code, and in the present invention, the d _model is the same as the image block embedding C. The value of i is a positive integer from 0 to 0.5C-1. After each item of the position code is calculated, the obtained codes are arranged in the following manner to obtain the 2D position code: PE (pos _X , 0), PE (pos _Y , 1), PE (pos _X , 2), PE (pos _Y , 3) ... PE (pos _X , 766), PE (pos _Y , 767).

经过重建编码器，原图像块嵌入被转换为中间张量。在中间张量中的被丢弃像素块位置插入可学习张量，得到解码器输入张量。可学习张量的形状与被替换的图像块数据形状相同，为C代表图像块嵌入的维度数，即768。将解码器输入张量输入解码器，得到经过像素重建的所有重建的图像块嵌入，将所述的重建的图像块嵌入解块后得到重建后的图像。After the reconstruction encoder, the original image block embedding is converted into an intermediate tensor. The learnable tensor is inserted into the position of the discarded pixel block in the intermediate tensor to obtain the decoder input tensor. The shape of the learnable tensor is the same as the shape of the replaced image block data, which is C represents the number of dimensions of the image block embedding, that is, 768. The decoder input tensor is input into the decoder to obtain embeddings of all reconstructed image blocks after pixel reconstruction, and the reconstructed image blocks are embedded in the deblocking to obtain a reconstructed image.

自监督掩码指导的图像建模网络仅对像素留存得分小于60的图像块计算损失，损失采用均方误差损失。The self-supervised mask-guided image modeling network only calculates the loss for image patches with pixel retention scores less than 60, and the loss uses the mean square error loss.

进一步地，在测试过程中，将测试数据送入收敛的实例分割网络，得到测试图片掩码M_iTest。将测试数据与测试图片掩码M_iTest共同输入所述的图像建模网络，得到去遮挡的行人图像。Furthermore, during the test process, the test data is fed into the converged instance segmentation network to obtain the test image mask _MiTest . The test data and the test image mask _MiTest are jointly fed into the image modeling network to obtain a de-occluded pedestrian image.

测试过程中仅有像素留存得分M_iP的获得方式与训练过程不同。所述的测试过程中像素留存得分M_iP的获得方式，如下公式：During the test, only the method for obtaining the pixel retention score M _iP is different from that during the training process. The method for obtaining the pixel retention score M _iP during the test is as follows:

M_iP＝Patch_formask(M_iTest,1) _MiP =Patch_formask( _MiTest ,1)

M_iP表示当前图片M_i中所有图像块的像素留存得分，每个得分取值从0到105，像素留存得分小于60的图像块会被标记为重建块，被标记的图像块对应的X_iP被丢弃且不会进入编码器部分。M _iP represents the pixel retention score of all image blocks in the current image M _i . Each score ranges from 0 to 105. Image blocks with pixel retention scores less than 60 will be marked as reconstructed blocks. The _XiP corresponding to the marked image blocks will be discarded and will not enter the encoder part.

Patch_formask是与Patch函数结构相同的图像分块函数，它们的区别在于Patch函数中卷积核的可学习参数为θ，Patch_formask函数中卷积核中的参数都不可学习且固定为1。像素留存得分的输出维度数为1，即每个图像块对应一个像素留存得分。Patch_formask is an image segmentation function with the same structure as the Patch function. The difference between them is that the learnable parameter of the convolution kernel in the Patch function is θ, while the parameters in the convolution kernel in the Patch_formask function are not learnable and are fixed to 1. The output dimension of the pixel retention score is 1, that is, each image block corresponds to one pixel retention score.

Patch_formask函数中1代表卷积核中的参数都不可学习且固定为1。In the Patch_formask function, 1 means that the parameters in the convolution kernel are not learnable and are fixed to 1.

其他过程与所述的训练过程完全相同。即为将测试数据送入实力分割网络，得到测试图片掩码。将测试数据与测试图片掩码共同送入自监督掩码指导的图像建模网络，得到去遮挡的行人图像。所述去遮挡的行人图像作为基于动态图与图卷积的遮挡行人重识别网络的测试数据。The rest of the process is exactly the same as the training process. That is, the test data is fed into the strength segmentation network to obtain the test image mask. The test data and the test image mask are fed into the self-supervised mask-guided image modeling network to obtain the de-occluded pedestrian image. The de-occluded pedestrian image is used as the test data for the occluded pedestrian re-identification network based on dynamic graph and graph convolution.

步骤S130：使用训练集B对遮挡行人重识别网络进行训练。其中行人重识别网络包括ResNet-50网络、动态图结构模块与图卷积特征传播模块。其中ResNet-50网络用于在行人重识别图像中提取特征图，所述动态图神经网络特征传播模块用于对标准化后的特征图进行特征传播以获得更为鲁棒且有判别力的特征。具体来说，动态图结构建立模块在ResNet-50的stage1、2、3、4四个阶段后分别插入多层动态图结构建立模块与动态图卷积特征传播模块(模块仅会对特征图的内容进行改变，并不会改变其形状)。Step S130: Use training set B to train the occluded pedestrian re-identification network. The pedestrian re-identification network includes a ResNet-50 network, a dynamic graph structure module and a graph convolution feature propagation module. The ResNet-50 network is used to extract feature maps in pedestrian re-identification images, and the dynamic graph neural network feature propagation module is used to propagate features on the standardized feature maps to obtain more robust and discriminative features. Specifically, the dynamic graph structure establishment module inserts a multi-layer dynamic graph structure establishment module and a dynamic graph convolution feature propagation module after stage 1, 2, 3, and 4 of ResNet-50 respectively (the module only changes the content of the feature map, and does not change its shape).

将高为H，宽为W的特征图输入动态图结构模块，得到该特征图对应的邻接矩阵R。The feature map with a height of H and a width of W is input into the dynamic graph structure module to obtain the adjacency matrix R corresponding to the feature map.

所述动态图结构模块的伪代码如下：The pseudo code of the dynamic graph structure module is as follows:

R＝θ(F)φ(F)^T R＝θ(F)φ(F) ^T

其中β代表可学习参数。Where β represents the learnable parameter.

L＝L_ID+L_Triplet+εL_C L＝L _ID +L _Triplet +εL _C

所述的L_Triplet表达式如下：The L _Triplet expression is as follows:

所述的L_ID表达式如下：The L _ID expression is as follows:

所述的L_C表达式如下：The L _C expression is as follows:

利用梯度下降算法进行网络训练，采用Adam默认配置。当损失值降到最低时训练结束。The network is trained using the gradient descent algorithm with the default configuration of Adam. The training ends when the loss value reaches the minimum.

步骤S140：得到训练至收敛的网络即可使用测试数据对基于动态图与图卷积的遮挡行人重识别网络进行测试。将步骤S120得到的去遮挡的行人图像送入行人重识别网络进行特征提取，得到每张图片的特征。对每个查询图片，按特征距离进行排序，距离越近的库图像会被排在前面，取距离最近的10个库图片作为查询图片的查询结果。将具有相同身份标签的库图片作为该查询图匹配到的正确结果，并且计算该查询图片的平均准确率。计算所有查询图片的平均准确率，得到本发明在该数据集上的平均准确率均值。计算所有查询图片特征距离最近库图片正确率的均值，得到首位命中率。Step S140: After the network is trained to convergence, the test data can be used to test the occluded pedestrian re-identification network based on dynamic graph and graph convolution. The de-occluded pedestrian image obtained in step S120 is sent to the pedestrian re-identification network for feature extraction to obtain the features of each image. For each query image, sort by feature distance, and the library images with closer distances will be ranked in front. The 10 library images with the closest distances are taken as the query results of the query image. The library images with the same identity label are taken as the correct results matched to the query graph, and the average accuracy of the query image is calculated. The average accuracy of all query images is calculated to obtain the average accuracy mean of the present invention on the data set. Calculate the average accuracy of the library images with the closest feature distance of all query images to obtain the first hit rate.

步骤S150：根据所得的分类结果，计算所述的遮挡行人重识别数据集Occluded-DukeMTMC上的行人重识别准确率。Step S150: Calculate the pedestrian re-identification accuracy rate on the occluded pedestrian re-identification dataset Occluded-DukeMTMC according to the obtained classification results.

通过本实施方式，首先将数据集分为有较为完整人体和有较大遮挡像素两部分(分别为训练数据A、B)，然后使用训练数据A、B对图像补全模型进行训练至收敛。使用训练数据B对遮挡行人重识别网络进行训练。在测试过程中，首先将测试数据送入所述的图像补全模型中：先送入所述的实例分割网络，得到每个测试数据对应的掩码后。将测试数据与测试图片掩码同时送入所述的图像建模网络中进行遮挡像素补全。最后将去遮挡的行人图像送入行人重识别网络进行特征提取与预测，得到网络预测的分类结果。Through this implementation, the data set is first divided into two parts: one with a relatively complete human body and one with large occluded pixels (training data A and B, respectively). Then, the image completion model is trained until convergence using training data A and B. The occluded pedestrian re-identification network is trained using training data B. During the test process, the test data is first sent to the image completion model: first sent to the instance segmentation network to obtain the mask corresponding to each test data. The test data and the test image mask are simultaneously sent to the image modeling network for occluded pixel completion. Finally, the de-occluded pedestrian image is sent to the pedestrian re-identification network for feature extraction and prediction to obtain the classification result predicted by the network.

进一步说明，假设将一个遮挡行人重识别数据集根据本实施方式进行分类，将得到一个准确率高于大多数方法的分类结果。To further illustrate, assuming that an occluded pedestrian re-identification dataset is classified according to this implementation, a classification result with higher accuracy than most methods will be obtained.

具体实施方式结果Detailed description of the invention Results

本实施方式采用已公开的遮挡行人重识别数据集Occluded-DukeMTMC。数据集的细节描述如下：This implementation uses the publicly available occluded person re-identification dataset Occluded-DukeMTMC. The details of the dataset are as follows:

在这个数据集中，所有查询图像都被各种各样的对象(例如，树、汽车、其他人)遮挡。训练、查询和图库集分别包含14％/15％/10％的遮挡图像。Occluded-DukeMTMC的训练集包含15,618张图像，共覆盖702个身份。测试集包含1,110个身份，其中17,661个图库图像和2,210个查询图像。Occluded-DukeMTMC数据集是当前遮挡行人重识别任务中最大，且难度最高的数据集。In this dataset, all query images are occluded by various objects (e.g., trees, cars, other people). The training, query, and gallery sets contain 14%/15%/10% occluded images, respectively. The training set of Occluded-DukeMTMC contains 15,618 images, covering a total of 702 identities. The test set contains 1,110 identities, including 17,661 gallery images and 2,210 query images. The Occluded-DukeMTMC dataset is the largest and most difficult dataset for the current occluded person re-identification task.

为了验证本实施方式(Ours)的优越性，将本实施方式与几种现有的遮挡行人重识别方法进行比较，包括BoT、MHSA、OAMN、HG等方法，将会比较这些方法在上述公开数据集上的平均准确率均值(mAP)与首位命中率(Rank-1)，具体的数据对比如表一所示。In order to verify the superiority of this implementation (Ours), this implementation is compared with several existing occluded pedestrian re-identification methods, including BoT, MHSA, OAMN, HG and other methods. The mean average precision (mAP) and the first hit rate (Rank-1) of these methods on the above public datasets will be compared. The specific data comparison is shown in Table 1.

表一Occluded-DukeMTMC数据集准确率(％)Table 1 Accuracy of Occluded-DukeMTMC dataset (%)

通过上表数据对比，可以清楚的看到，Ours达到了最好的性能，显著提高了遮挡行人重识别的准确率。定量结果充分说明了Ours的优越性，因为Ours能更好的补全遮挡图像中的人体关键点信息和视角差异性信息。Ours使用卷积神经网络与动态图神经网络集合的方式，有效的扩大的网络的感受野，从而提高了对遮挡行人重识别数据的鲁棒性。大量实验表明，该方法优于现有方法。By comparing the data in the above table, we can clearly see that Ours achieves the best performance and significantly improves the accuracy of occluded pedestrian re-identification. The quantitative results fully demonstrate the superiority of Ours, because Ours can better complete the human key point information and perspective difference information in the occluded image. Ours uses a combination of convolutional neural networks and dynamic graph neural networks to effectively expand the network's receptive field, thereby improving the robustness of occluded pedestrian re-identification data. A large number of experiments show that this method is superior to existing methods.

本实施方式提出了一种基于掩码自监督遮挡像素重建的遮挡行人重识别方法，用于对遮挡图像进行行人重识别任务。This embodiment proposes an occluded person re-identification method based on mask self-supervised occluded pixel reconstruction, which is used to perform pedestrian re-identification tasks on occluded images.

通过训练一个由实例分割网络与自监督掩码指导的图像建模网络构成的图像补全模型，对图像中被遮挡的部分行人像素进行重建，得到去遮挡的行人图像。后通过一个基于动态图和图卷积的遮挡行人重识别网络对得到的去遮挡的行人图像进行行人重识别。所述的遮挡行人重识别网络使用了动态图结构可以有效的扩大卷积神经网络的感受野，增加其学习到的特征图的判别性与鲁棒性。而在公开数据集Occluded-DukeMTMC上的实验结果表明，本实施方式相对于其他方法有着更高的准确率，有着更好的优越性。By training an image completion model consisting of an instance segmentation network and an image modeling network guided by a self-supervised mask, the occluded pedestrian pixels in the image are reconstructed to obtain a de-occluded pedestrian image. Then, an occluded pedestrian re-identification network based on dynamic graphs and graph convolution is used to re-identify the de-occluded pedestrian image. The occluded pedestrian re-identification network uses a dynamic graph structure to effectively expand the receptive field of the convolutional neural network and increase the discriminability and robustness of the feature graph learned. The experimental results on the public dataset Occluded-DukeMTMC show that this implementation has higher accuracy and better superiority than other methods.

应当理解的是，本发明的上述具体实施方式仅仅用于示例性说明或解释本发明的原理，而不构成对本发明的限制。因此，在不偏离本发明的精神和范围的情况下所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。此外，本发明所附权利要求旨在涵盖落入所附权利要求范围和边界、或者这种范围和边界的等同形式内的全部变化和修改例。It should be understood that the above specific embodiments of the present invention are only used to illustrate or explain the principles of the present invention, and do not constitute a limitation of the present invention. Therefore, any modifications, equivalent substitutions, improvements, etc. made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. In addition, the appended claims of the present invention are intended to cover all changes and modifications that fall within the scope and boundaries of the appended claims, or the equivalent forms of such scope and boundaries.

Claims

1. An occlusion pedestrian re-identification method based on mask self-supervision occlusion pixel reconstruction, which is characterized by comprising the following steps:

Collecting training data D, training data E and test data from a monitoring camera device, wherein the training data D comprises blocked pedestrian re-identification pictures and instance segmentation level labels and pedestrian numbers, the training data E comprises pedestrian re-identification pictures of relatively complete human body parts, and the test data only comprises original blocked pedestrian re-identification pictures and pedestrian numbers; constructing a mask self-encoder fine-tuning image complement model based on mask guidance, wherein the image complement model consists of an instance segmentation network and an image modeling network guided by self-supervision mask, the instance segmentation network is used for segmenting pedestrian instances of an input image to obtain masks of each pedestrian, a random function is used for randomly generating pixel retention scores of different image blocks in a pedestrian mask image, the image modeling network guided by self-supervision mask is used for reconstructing the image blocks with the pixel retention scores smaller than 60 in the pedestrian image to obtain a de-occluded pedestrian image, and training data D and training data E are used for training the image complement model to be converged;

And predicting test data by using the complement model: inputting the test data into the example segmentation network to obtain pedestrian masks corresponding to the test data, calculating pixel retention scores of different image blocks in pedestrian mask images by using an image block function, and reconstructing the image blocks with the pixel retention scores smaller than 60 in the pedestrian images by using an image modeling network guided by a self-supervision mask to obtain a de-blocked pedestrian image;

Constructing a shielding pedestrian re-recognition network based on dynamic graph and graph convolution, wherein the pedestrian re-recognition network consists of a ResNet-50 network, a dynamic graph structure module and a graph convolution feature propagation module, extracting a feature graph in a pedestrian image by using the ResNet-50 network, constructing a topological graph structure corresponding to the feature graph by using the dynamic graph structure module, performing feature propagation on the topological graph structure by using the graph convolution feature propagation module, and training the shielding pedestrian re-recognition network to convergence by using training data E;

In the prediction process of the shielding pedestrian re-recognition network, inputting a pedestrian image subjected to de-shielding into the converged pedestrian re-recognition model, so that characteristics of each pedestrian image can be obtained, sequencing each query image according to characteristic distances from small to large, taking 10 library images with the nearest distance as query results of the query image, taking library images with the same identity label as correct results matched with the query image, and calculating average accuracy and first hit rate of the query image;

the training process of the blocked pedestrian re-identification network based on the convolution of the dynamic graph and the graph is as follows:

The pedestrian re-recognition network uses a dynamic graph structure module and a graph convolution feature propagation module to assist in propagation of features, the dynamic graph structure module establishes different graph structures at a plurality of positions in a convolutional neural network, namely, the feature graph is converted into a K nearest neighbor graph structure at the plurality of positions in the convolutional neural network, the dynamic graph structure module inputs a feature graph with height of H and width of W, an adjacent matrix A corresponding to the feature graph is output, and pseudo codes of the dynamic graph structure module are as follows:

then, calculating the correlation of each node in the feature map to obtain a correlation matrix, wherein the formula is as follows:

R＝θ(F)φ(F)^T

Wherein R is a correlation matrix, The method comprises the steps that a characteristic diagram output by a convolution layer is represented by C, the dimension number of the characteristic diagram is represented by W and H, the width and the height of the characteristic diagram are represented by X ^T, the transposition of a matrix X is represented by X (F), and two transfer functions with identical structures and different parameters are respectively input into the characteristic diagram F, wherein the transfer functions consist of a 1X 1 convolution layer, a batch normalization layer and a ReLU activation function;

Multiplying the correlation matrix R by the adjacency matrix A, and normalizing by a softmax function to obtain a similarity adjacency matrix, wherein the formula is as follows:

Wherein, The similarity adjacency matrix is G is the number of nodes in the feature graph, A is the adjacency matrix output by the dynamic graph structure module, and the Hadamard product of the matrix is represented by the following formula:

Wherein, Is a feature that has been propagated through the feature,The characteristic diagram output by the convolution layer is obtained by matrix transformation, and the propagated characteristic can be converted into the characteristic diagram again through matrix transformation;

All the processes of the dynamic graph structure module and the graph rolling characteristic propagation module are denoted as OGA (F), Is a feature map of the convolutional layer output; introducing the residual structure into the characteristic propagation process, and stacking the OGA modules for a plurality of times can enable the characteristics to be propagated more fully, wherein the stacking mode of the OGA modules with the residual structure is as follows:

Wherein β represents a learnable parameter;

in the training process, the loss function expression of the pedestrian re-recognition network is as follows:

L＝L_ID+L_Triplet+εL_C

Where L _Triplet is the triplet loss, L _ID is the ID loss, L _C is the Center loss, ε is the balance weight of the Center loss, which is set to 0.1 in the present network;

The expression of L _Triplet is as follows:

where B is a training small sample number, A reference sample is represented and a reference sample is represented,Representing positive samples of the same class as the reference sample but different,Negative samples of different categories from the reference sample are represented, alpha represents a set training interval, and is set to 0.2, and f (x) is the characteristic of picture x;

The expression of L _ID is as follows:

In the above formula, y represents the true tag value of the training sample, Z represents the number of pedestrians in the data set, f (x _i) represents the embedding of the network predicted by the picture x _i, and epsilon is constant and is set to 0.1;

the expression of L _C is as follows:

Where f (x _i) is the feature of picture x _i, B is a training small sample number, The center of all features of category y _i, i.e., the average of the features of all pictures of category y _i in a small lot, is represented.

2. The method for re-identifying the blocked pedestrians based on the mask self-supervision blocked pixel reconstruction according to claim 1, wherein the full model is used for carrying out de-blocking processing on the blocked pedestrian picture, and the full model training process comprises the following steps:

(1) Training the existing example segmentation network Mask2former to be converged by using training data D to obtain an example segmentation network capable of outputting a human Mask in a pedestrian picture, wherein the converged example segmentation network can predict a pedestrian Mask M _i corresponding to a picture X _i with an identity of i, wherein pixels of non-occluded pedestrian pixels in an original picture X _i in the corresponding pedestrian Mask M _i picture position are white, and other pixels are black;

(2) Reconstructing an occluded part of pedestrian pixels in an occluded pedestrian image by using an image modeling network guided by a self-supervision mask, and carrying out image reconstruction on the occluded pedestrian image H and W are the height and width of the picture, 3 is the three dimensions of the RGB picture, the three dimensions are converted into image block embedding by using an image block function, wherein the height and width of each image block are P _h and P _w, for the picture X _i and the mask M _i, the image block function is to convolve each picture X _i, the convolution kernel size is P _h and P _w, the step size is also P _h and P _w, the output dimension of each image block is C, and the image block embedding obtained by converting the picture into the image block embedding by using the image block function and leveling is as follows:

X_iP＝Patch(X_i,θ)

M_iP＝Random(M_i)

Where Patch is an image blocking function, Where C is the number of dimensions, (P _h,P_w) represents the resolution of each image block,Is the total number of image blocks, θ is a learnable parameter, the Random function calculates the number of image blocks according to the size of the input mask image, and randomly generates a corresponding number of pixel retention scores, M _iP represents the pixel retention scores of all the image blocks in the current picture X _i, each score takes a value from 0 to 105, the image blocks with pixel retention scores less than 60 are marked as reconstruction blocks, X _iP corresponding to the marked image blocks are discarded and do not enter the encoder part, and M _iP is generated using a Random function in the training phase;

(3) The non-discarded image blocks are embedded and added with the image block position codes, and then are input into a pixel reconstruction encoder, wherein the position codes use a 2D position coding formula as follows:

The pos _X and pos _Y are the abscissa and ordinate of the image block in the original image, the D _model is the dimension number of the position code, D _model is the same as the embedded C of the image block, the value of i is a positive integer from 0 to 0.5C-1, after each position code is calculated, the obtained codes are arranged as follows to obtain the 2D position code ：PE(pos_X,0),PE(pos_Y,1),PE(pos_X,2),PE(pos_Y,3)...PE(pos_X,C-2),PE(pos_Y,C-1);

Embedding an original image block into an input reconstruction encoder to obtain an intermediate tensor, inserting a learnable tensor into the discarded pixel block position in the intermediate tensor to obtain an intermediate tensor to be learned, wherein the shape of the learnable tensor is the same as the embedded shape of a replaced image block, and the shape of the learnable tensor is as followsC represents the dimension number of the image block embedding, the intermediate tensor to be learned is input into a decoder to obtain the reconstructed image block embedding, and the reconstructed image block embedding is unpacked to obtain a reconstructed image;

The self-supervised mask guided image modeling network computes the loss only for image blocks with pixel retention scores less than 60, using the mean square error as the loss function.

3. The method for re-identifying the blocked pedestrian based on the mask self-supervision blocked pixel reconstruction according to claim 1, wherein the pixel retention score M _iP is obtained in a different manner from the training process in the completion model prediction process;

Sending the test data into a converged instance segmentation network to obtain a test picture mask M _iTest, inputting the test data and the test picture mask M _iTest into the image modeling network together to obtain a pedestrian image with de-occlusion, wherein the image modeling network flow in the prediction process is as follows:

The pedestrian picture and the test picture mask M _iTest are input into an image blocking function together to obtain an image block embedding and pixel retention score M _iP obtained by embedding and leveling the image blocks, the image blocks with the pixel retention score M _iP smaller than 60 are discarded, the non-discarded image block embedding and the image block position coding are added, the image block is input into a pixel reconstruction encoder to obtain an intermediate tensor, a learnable tensor is inserted into the discarded pixel block position in the intermediate tensor to obtain an intermediate tensor to be learned, the intermediate tensor to be learned is input into a decoder to obtain a reconstructed image block embedding, and the reconstructed image block is obtained after the reconstructed image block embedding unpacking;

The pixel retention score M _iP in the prediction process is obtained by the following formula:

M_iP＝Patch_formask(M_iTest,1)

M _iP represents the pixel retention score of each image block in the current picture X _i, which takes on a value from 0 to 105, the image block with pixel retention score less than 60 is marked as a reconstructed block, the X _iP corresponding to the marked image block is discarded and does not enter the encoder section, patch_ formask is an image block function with the same structure as the Patch function, the difference is that the learnable parameter of the convolution kernel in the Patch function is θ, none of the parameters in the convolution kernel in the patch_ formask function is learnable and fixed to 1, The number of output dimensions of the pixel retention score is 1, i.e., one pixel retention score for each image block, no parameter in the convolution kernel is learnable and fixed at 1 for 1 in the Patch_ formask function.

4. The method for re-identifying the blocked pedestrian based on the mask self-supervision blocked pixel reconstruction according to claim 1, wherein the prediction process of the blocked pedestrian re-identification network based on the convolution of the dynamic graph and the graph is as follows:

After all networks are trained to be converged, sending test data into a mask self-encoder fine-tuning image complement model based on mask guidance to obtain a de-blocked pedestrian image, sending the de-blocked pedestrian image into a blocked pedestrian re-recognition network based on dynamic graph and graph convolution to obtain the characteristics of each picture, sequencing each query picture according to the characteristic distance from small to large, and taking the 10 library pictures with the nearest distance as the query result of the query picture;

And taking the library pictures with the same identity labels as the correct results matched with the query picture, calculating the average accuracy of all the query pictures to obtain the average accuracy on the data set, and calculating the average of the accuracy of all the query picture features from the nearest library picture to obtain the first hit rate.