CN113792669A

CN113792669A - Pedestrian re-identification baseline method based on hierarchical self-attention network

Info

Publication number: CN113792669A
Application number: CN202111087471.7A
Authority: CN
Inventors: 陈炳才; 张繁盛; 聂冰洋
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-14
Anticipated expiration: 2041-09-16
Also published as: CN113792669B

Abstract

The invention provides a pedestrian re-identification baseline method based on a hierarchical self-attention network, and belongs to the field of computer vision. In the invention, the Swin transform is creatively used as a backbone network to be introduced into the field of pedestrian re-identification, the form of weighted sum of ID loss and Circle loss is used as a loss function, and the feature extraction capability is improved while the simple structure is ensured through effective data preprocessing and reasonable parameter adjustment. Compared with the traditional baseline method based on ResNet, the pedestrian re-identification method based on the RESNet has the advantage that the effect of pedestrian re-identification is remarkably improved.

Description

Pedestrian re-identification baseline method based on hierarchical self-attention network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a pedestrian re-identification baseline method based on a hierarchical self-attention network.

Background

The pedestrian re-identification needs to utilize a computer vision technology to identify a specific pedestrian under a cross-camera environment, a pedestrian monitoring image is given, the image of the pedestrian under a cross-device is searched, and the identification of the specific pedestrian has very important significance for violation judgment, criminal investigation, danger early warning and the like.

A good baseline method should achieve good effects while ensuring low parameters, the existing pedestrian re-identification baseline method is based on ResNet and limited by the limitation and insufficiency of a convolutional neural network on feature extraction, and the ResNet-based baseline method cannot achieve ideal effects.

With the progress of research, transformers are gradually applied to the field of computer vision. The existing pedestrian re-identification method based on the Transformer has the problems of overlarge calculated amount, single characteristic receptive field and the like.

Disclosure of Invention

The invention provides a pedestrian re-identification baseline method based on a hierarchical self-attention network, which aims to solve the problems in the background technology and achieve a good effect while having a simple structure.

The technical scheme of the invention is as follows:

a pedestrian re-identification baseline method based on a hierarchical self-attention network comprises the following specific steps:

step one, preprocessing data;

setting a total of N different pedestrians, each pedestrian including M_iAn image of where M_i>1，M_iIndicates the number of images in the class of the ith pedestrian, i indicates the ID number of each pedestrian; for the ith pedestrian, M_i1 image is used as a training set, 1 image is used as a verification set, i is used as a label and indicates that the image corresponds to the ith pedestrian;

1.1) using a bicubic interpolation algorithm to scale an image to (H, W, C) as an input image, wherein H represents the length of the image, W represents the width of the image, C represents the number of channels of the image, and C takes the value of 3; the method comprises the following specific steps:

1.1.1) constructing a Bicubic function:

wherein a is a variable value in the coefficient and is used for controlling the shape of the Bicubic curve;

1.1.2) interpolation formula is as follows:

wherein (x, y) represents the pixel point to be interpolated, and for each pixel point, 4 × 4 pixel points near the pixel point are taken to perform bicubic interpolation operation.

1.2) carrying out data enhancement by using a random erasure algorithm;

1.2.1) setting a threshold probability p, generating a random number p1 of 0-1, when p1> p, the input image is not processed, otherwise, the input image needs to be erased:

p1＝Rand(0,1) (3)

1.2.2) determining an erasing area;

H_e＝Rand(H/8,H/4) (4)

W_e＝Rand(W/8,W/4) (5)

S_e＝H_e×W_e (6)

where H denotes a length of the input image and W denotes a width of the input image; h_eIndicates the length of erase, W_eIndicates the width of erase, S_eRepresents the area of erase;

1.2.3) determining an erasing coordinate;

x_e＝Rand(0,H-H_e) (7)

y_e＝Rand(0,W-W_e) (8)

wherein x is_eRepresenting the upper left-hand x-coordinate of the erasure, y_eRepresenting the erased upper left y coordinate.

Inputting the preprocessed image into a hierarchical self-attention network, namely a Swin Transformer neural network, and carrying out forward transmission;

the backbone network comprises 4 processing stages, wherein the 2-4 stages have the same network structure, and the specific steps are as follows:

2.1) stage 1;

2.1.1) block division; starting from the upper left corner of the image, the input image is divided into a non-coinciding set of image blocks, wherein each image block has a size of 4 x 4, the image is divided into a number of (4,4,3) image blocks, wherein the number of image blocks N_patchComprises the following steps:

N_patch＝(H/4)×(W/4) (9)

2.1.2) linear embedding; flattening each image block into a vector with the dimension of C through a full connection layer, and sending the image blocks into two continuous Swin blocks;

2.1.3) extracting features from Swin blocks;

the Swin block comprises a Swin block 1 and a Swin block 2; the Swin block 1 mainly comprises a window multi-head self-attention module and a multilayer sensor, wherein layer standardization processing is carried out before the two modules, and residual connection is added after the two modules; the Swin block 2 mainly comprises a moving window multi-head self-attention module and a multilayer sensor, wherein layer standardization processing is carried out before the two modules, and residual connection is added after the two modules;

after the extraction of the Swin block, key feature information of the head, the hands, the actions and the like of the pedestrian can be obtained, and a feature set (H/4, W/4, C) is output;

2.2) stage 2;

2.2.1) block fusion; combining the input feature sets pairwise, adjusting the feature dimension to be twice of the original dimension by using a full-connection layer, and outputting feature sets (H/8, W/8 and 2C);

2.2.2) extracting features from Swin blocks; completely consistent with the Swin block structure in 2.1.3), and outputting a key feature set (H/8, W/8,2C) after the Swin block processing;

2.3) stages 3-4;

the network structures of the stage 3 and the stage 4 are completely consistent with the network structure of the stage 2, and after processing, feature sets (H/16, W/16,4C) and (H/32, W/32,8C) are respectively output;

2.4) a global average pooling layer and a full connection layer; and (3) carrying out global average pooling on the feature set output in the stage (4) to obtain a vector with the length of 8C, and mapping the features into N through a full connection layer, wherein N is the type of the pedestrians in the data set in the step (one).

Step three, calculating a loss function, and reversely propagating and updating network parameters;

3.1) the loss function consists of two parts, ID loss and Circle loss, and the formula is as follows:

L_reid＝w₁L_id+w₂L_circle (10)

wherein w₁And w₂Weights representing ID loss and Circle loss, respectively; l is_reidRepresenting the total loss function, L_idIndicates ID loss, L_circleIndicating Circle loss;

3.2) the ID loss formula is as follows:

where n denotes the number of samples per batch training, p (y)_i|x_i) Representing an input image x_iIs set as label y_iThe conditional probability of (a);

3.3) Circle loss formula as follows:

Δn＝m (13)

Δm＝1-m (14)

where N denotes the number of different pedestrian classes, M_iRepresenting the number of images in the class of the ith pedestrian; gamma is a scale parameter; m is the stringency of the optimization; s_nIs an inter-class similarity score matrix, S_pIs an intra-similarity score matrix; a is_nAnd a_pIs a non-negative matrix, respectively S_nAnd S_pThe formula is as follows:

wherein S_nIs an inter-class similarity score matrix, S_pIs an intra-similarity score matrix;

3.4) setting hyper-parameters and training a network; adopting a preheating learning rate, wherein the learning rate is initially r and gradually increased to ten times of r in the previous 10 times of training; the optimizer adopts an optimized random gradient descent algorithm with an increase value of d₁The weight decay sum of (1) is d₂The momentum of (a); performing back propagation by using the set optimizer and learning rate and combining the loss values calculated in 3.1) -3.3), and updating network parameters.

Fourthly, carrying out pedestrian re-identification matching;

and (3) zooming the image of the pedestrian to be detected, inputting the zoomed image of the pedestrian to be detected into the Swin transform neural network in the step two, outputting the zoomed image, and processing the zoomed image by using softmax to obtain N probability values which respectively correspond to the probabilities that the pedestrian belongs to different classes, wherein the class with the maximum probability value is the identity of the pedestrian.

The invention has the beneficial effects that: the invention provides a pedestrian re-recognition baseline method based on a hierarchical self-attention network, which creatively introduces Swin transform as a backbone network into the field of pedestrian re-recognition, takes the form of weighted sum of ID loss and Circle loss as a loss function, and greatly improves the training effect while ensuring simple structure through effective data preprocessing and reasonable parameter adjustment.

Drawings

FIG. 1 is a general improved concept diagram of the present invention;

FIG. 2 is a model diagram of a baseline pedestrian re-identification method based on a hierarchical self-attention network according to the present invention;

fig. 3 is a schematic diagram of the Swin block structure.

Detailed Description

The following detailed description of the embodiments of the present invention is provided with reference to the accompanying drawings, which are used to implement the embodiments of the present invention, and give detailed descriptions and specific operation procedures. The data set of a specific experiment is a Market1501 data set collected in a university, and the training set comprises 751 persons and 12936 images; the test set had 750 people, and contained 19732 images.

Fig. 1 is a general improvement idea diagram of the present invention, and fig. 2 is a model diagram of a pedestrian re-identification baseline method based on a hierarchical self-attention network according to the present invention, which includes the following specific steps:

step one, preprocessing data;

751 people exist in the training set, N is 751, and each pedestrian comprises M_iAn image of where M_i>1，M_iIndicates the number of images in the class of the ith pedestrian, i indicates the ID number of each pedestrian; for the ith pedestrian, M_i1 image is used as a training set, 1 image is used as a verification set, i is used as a label and indicates that the image corresponds to the ith pedestrian;

1.1) scaling (224, 3) the image using a bicubic interpolation algorithm, where H represents the length of the image, W represents the width of the image, and C represents the number of channels of the image, as follows:

1.1.1) constructing a Bicubic function:

wherein a is-0.5, which is the variable value in the coefficient and is used for controlling the shape of the Bicubic curve;

1.1.2) interpolation formula is as follows:

1.2) carrying out data enhancement by using a random erasure algorithm;

1.2.1) setting the threshold probability p to 0.5, generating a random number p1 of 0-1, when p1> p, the image is not processed, otherwise, the image needs to be erased:

p1＝Rand(0,1) (3)

1.2.2) determining an erasing area;

H_e＝Rand(H/8,H/4) (4)

W_e＝Rand(W/8,W/4) (5)

S_e＝H_e×W_e (6)

1.2.3) determining an erasing coordinate;

x_e＝Rand(0,H-H_e) (7)

y_e＝Rand(0,W-W_e) (8)

2.1) stage 1;

N_patch＝(H/4)×(W/4) (9)

where H, W refer to the length and width of the input image, respectively, and N is the time_patch＝56x56；

2.1.2) linear embedding; flattening each image block into a vector with dimension of 128 through a full connection layer, and sending the vector into two continuous Swin blocks;

2.1.3) extracting features from Swin blocks;

as shown in fig. 3, the Swin block includes Swin block 1 and Swin block 2; the Swin block 1 mainly comprises a window multi-head self-attention module and a multilayer sensor, wherein layer standardization processing is carried out before the two modules, and residual connection is added after the two modules; the Swin block 2 mainly comprises a moving window multi-head self-attention module and a multilayer sensor, wherein layer standardization processing is carried out before the two modules, and residual connection is added after the two modules;

after the extraction of the Swin block, key feature information of the head, hands, actions and the like of the pedestrian can be obtained, a feature set (56, 128) is output and transmitted to the next module;

2.2) stage 2;

2.2.1) block fusion; combining the input feature sets pairwise, adjusting feature dimensions to twice of the original feature dimensions by using a full connection layer, and outputting the feature sets (28, 256);

2.2.2) extracting features from Swin blocks; the structure is completely consistent with the structure 2.1.3), and a feature set (28, 256) is output after the Swin block processing;

2.3) stages 3-4;

the structure of the stage 3 and the stage 4 is completely consistent with that of the stage 2, and after processing, feature sets (14, 512) and (7, 1024) are respectively output;

2.4) a global average pooling layer and a full connection layer; and (3) carrying out global average pooling on the feature set output in the stage 4 to obtain a vector with the length of 1024, and mapping the features into 751 classes through a full-connection layer, wherein 751 is the class of pedestrians in the data set used in the embodiment.

L_reid＝w₁L_id+w₂L_circle (10)

wherein w₁And w₂Weights, w, representing ID loss and Circle loss, respectively₁The value of 0.4, w₂The value is 0.6; l is_reidRepresenting the total loss function, L_idIndicates ID loss, L_circleIndicating Circle loss;

3.2) the ID loss formula is as follows:

where n represents the number of samples per batch training, the value of this embodiment is 16, p (y)_i|x_i) Representing an input image x_iIs set as label y_iThe conditional probability of (a);

3.3) Circle loss formula as follows:

Δn＝m (13)

Δm＝1-m (14)

wherein N represents the number of different types of pedestrians, and the value 751 is taken by the embodiment; m_iRepresenting the number of images in the class of the ith pedestrian; γ is a scale parameter, 32 in this example; m is the severity of the optimization, which is 0.25 in this example; s_nIs an inter-class similarity score matrix, S_pIs an intra-similarity score matrix; a is_nAnd a_pIs a non-negative matrix, respectively S_nAnd S_pThe formula is as follows:

and 3.4) setting hyper-parameters during training of the neural network as shown in the table 1, performing back propagation by using the set optimizer and learning rate in combination with the loss values calculated in 3.1) -3.3), and updating network parameters.

TABLE 1 hyper-parameter settings for training networks

Fourthly, carrying out pedestrian re-identification matching;

and (3) zooming the image of the pedestrian to be detected, inputting the zoomed image of the pedestrian to be detected into the Swin transform neural network in the step two, outputting the zoomed image, and processing the zoomed image by using softmax to obtain 751 probability values which respectively correspond to the probabilities that the pedestrian belongs to different classes, wherein the class with the maximum probability value is the identity of the pedestrian.

In the embodiment, a pedestrian re-recognition effect test is performed based on a Market1501 data set, and compared with the existing pedestrian re-recognition baseline model based on global features, as shown in table 2:

table 2 comparison of results with existing baseline model

The comparison of experimental results shows that the base line model provided by the invention can effectively improve Rank1 and mAP indexes of pedestrian re-identification, proves the effectiveness of the method and has great promoting significance on the practical application of pedestrian re-identification; in addition, the network structure is simple, the expandability is strong, and great reference significance is provided for designing pedestrian re-identification methods in the future.

Claims

1. A pedestrian re-identification baseline method based on a hierarchical self-attention network is characterized by comprising the following steps:

step one, preprocessing data;

1.1.1) constructing a Bicubic function:

1.1.2) interpolation formula is as follows:

wherein (x, y) represents the pixel points to be interpolated, and for each pixel point, 4 × 4 pixel points nearby the pixel point are taken to perform bicubic interpolation operation;

1.2) carrying out data enhancement by using a random erasure algorithm;

p1＝Rand(0,1) (3)

1.2.2) determining an erasing area;

H_e＝Rand(H/8,H/4) (4)

W_e＝Rand(W/8,W/4) (5)

S_e＝H_e×W_e (6)

1.2.3) determining an erasing coordinate;

x_e＝Rand(0,H-H_e) (7)

y_e＝Rand(0,W-W_e) (8)

wherein x is_eRepresenting the upper left-hand x-coordinate of the erasure, y_eRepresents the upper left-hand y coordinate of the erasure;

2.1) stage 1;

N_patch＝(H/4)×(W/4) (9)

2.1.3) extracting features from Swin blocks;

after extraction is carried out through a Swin block, key feature information of the head, the hands and the actions of the pedestrian is obtained, and a feature set (H/4, W/4, C) is output;

2.2) stage 2;

2.3) stages 3-4;

2.4) a global average pooling layer and a full connection layer; performing global average pooling on the feature set output by the stage 4 to obtain a vector with the length of 8C, and mapping the features to N through a full connection layer, wherein N is the type of pedestrians in the data set in the step one;

L_reid＝w₁L_id+w₂L_circle (10)

3.2) the ID loss formula is as follows:

3.3) Circle loss formula as follows:

Δn＝m (13)

Δm＝1-m (14)

3.4) setting hyper-parameters and training a network; adopting a preheating learning rate, wherein the learning rate is initially r and gradually increased to ten times of r in the previous 10 times of training; the optimizer adopts an optimized random gradient descent algorithm with an increase value of d₁The weight decay sum of (1) is d₂The momentum of (a); performing back propagation by using the set optimizer and learning rate and combining the loss values calculated in 3.1) -3.3), and updating network parameters;

fourthly, carrying out pedestrian re-identification matching;