CN108520226B

CN108520226B - Pedestrian re-identification method based on body decomposition and significance detection

Info

Publication number: CN108520226B
Application number: CN201810288204.8A
Authority: CN
Inventors: 张云洲; 刘一秀; 王松; 史维东; 孙立波; 刘双伟; 李瑞龙
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-04-03
Filing date: 2018-04-03
Publication date: 2020-07-28
Anticipated expiration: 2038-04-03
Also published as: CN108520226A

Abstract

The invention discloses a pedestrian re-identification method based on body decomposition and significance detection, which comprises the steps of firstly, analyzing a pedestrian image into a region with a Deep Decomposition Network (DDN) in a semantic meaning, separating pedestrians from a cluttered environment by using a sliding window and color method for matching, then, dividing the pedestrian image into small blocks, and automatically selecting an effective picture region according to a background subtraction result and a significance region.

Description

Pedestrian re-identification method based on body decomposition and significance detection

Technical Field

The invention belongs to the field of image processing, and particularly relates to a pedestrian re-identification method based on body decomposition and significance detection.

Background

The purpose of the pedestrian re-identification technology is to identify all images of a person captured by different cameras, which is an important aspect of the intelligent video research field, and is a rapidly developing but still improved subject in the current computer vision field. Under the influence of different visual angles, illumination and scales of shooting, the difference of features such as the posture, the color and the outline of a pedestrian image is large, and how to improve the re-recognition rate of the pedestrian image is still a great challenge.

Therefore, how to increase the re-recognition rate of the pedestrian image becomes a technical problem to be solved at present.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a pedestrian re-identification method based on body decomposition and significance detection, which can improve the re-identification rate of pedestrian images in different cameras.

In a first aspect, the present invention provides a pedestrian re-identification method based on body decomposition and saliency detection, including:

s1, aiming at a pedestrian picture to be processed in a first camera, dividing the picture into a plurality of picture blocks, and decomposing the picture blocks into semantic regions of a Deep Decomposition Network (DDN); that is, the DDN divides each part of the human body in the whole picture into different colors, and the sliding window selects a picture block with a high overlap rate with color matching, so that pedestrians can be separated from the environment;

s2, processing the plurality of picture blocks by a sliding window and color matching mode based on the semantic region of the DDN, and acquiring a reserved picture block set of the picture S1;

s3, automatically detecting a salient region in the picture by adopting a visual salient GBVS identification mode; that is, to reduce the huge cost of matching time, the present embodiment selects 25 blocks with relatively high significance scores, which is an empirical result of the trade-off between time consumption and matching accuracy;

s4, carrying out significance matching on each picture block in the set S1 and the significance region to obtain a significance value of each picture block in the set S1;

s5, acquiring an image block of each pedestrian image to be processed in the second camera and the significance value of each image block based on the image blocks in the set S1 and the significance value of each image block;

the number of the image blocks is consistent with the number of the image blocks in the filtered set S1;

s6, extracting the feature vectors representing all image blocks and the feature vectors representing all image blocks in the set S1;

and S7, fusing the feature vectors of all the image blocks and the feature vectors of all the image blocks based on metric learning, and acquiring the recognition result of the current image in the second camera.

Optionally, the method further comprises:

repeating the steps S5 to S7, and obtaining the identification results of all the images in the second camera.

Optionally, the step S2 includes:

judging whether to take the mask for the picture block or not by calculating the overlapping rate of colors of each part of the pedestrian obtained by the sliding mask window and the semantic region segmented by the DDN network, wherein each image/picture is divided into picture blocks with the size of 10 × 10 and has no overlapping region;

wherein P is_ijPicture blocks represented in ith row and jth column, i, j ∈ N₊{ i, j | i ≦ m, j ≦ n }, m being the number of blocks in the horizontal direction after the image has been divided into picture blocks by the grid, n being the number of blocks in the vertical direction after the image has been divided into picture blocks by the grid, c (P)_ij) Showing the sliding masks M and P_ijThe overlapping rate between them; and u (x) is expressed as the number of non-zero elements in the matrix x; in addition, x_pAnd y_pRepresenting the number of picture blocks in the horizontal and vertical directions, respectively;

wherein, c (P)_ij)<The 25% of the picture patches are retained, and the retained patches of each image are defined as a set S1. That is, c (P)_ij)<The 25% picture blocks are reserved and the other picture blocks are masked out. The result of the background subtraction is a picture blockThe basic condition for selection, in this embodiment, the reserved color patches of each image are defined as a set S1.

Optionally, the step S5 includes:

finding an image block at a position corresponding to each picture block in the set S1 from the image to be processed of the second camera B based on the position of each picture block in the set S1 of the first camera A;

defining a significant similarity between pairs of picture blocks of different cameras as

The salient picture blocks of the pedestrian image are represented as

Where (A, u) denotes a picture under the first camera A, i denotes the position of the decomposed picture block in the picture,

is the saliency vector of the picture block; d (-) is the Euclidean distance, σ_dIs a bandwidth parameter; the corresponding picture block is obtained under the second camera B:

I^B,u＝find(min(sim_saliency(P^A,u,P^B,v))) (3)

substituting equation (2) to obtain:

the function find (-) represents the image block index to find the image under the second camera B from the saliency match with the picture block of the picture under the first camera A;

is I^B,uU represents the index value, i ∈ {1,2.. 25 }.

Optionally, the step S6 includes:

s61, L OMO feature analyzes horizontal feature of local feature and maximizes video to represent stable change of visual point;

s62, processing illumination change by applying Retinex transformation and a scale invariant texture operator;

s63, in order to make it easier to re-identify pedestrians than using the original image, the method of the present invention uses HSV color histograms to extract feature vectors with dimensions 8 × 512;

s64, locating the picture blocks in the 128 × 48 image by using a sliding window with the size of 10 × 10 and an overlapping step of 5 pixels, in order to make the re-identification of the person easier than using the original image, extracting the two scales of the SI L TP histogram

And

SI L TP size 3⁴Establishing a three-scale pyramid representation, performing down-sampling on an original 128 × 48 image through two 2 × 2 local average pooling operations, and then repeating the feature vector extraction process to obtain the final feature vector which is (8 × 8+34 × 2) (24+11+5 horizontal groups) ═ 26960-dimensional features;

s65, PHOG, HSV histogram and SIFT feature vector are extracted from each selected tile;

setting the number of pyramid layers to be L-3, setting the number of gradient partitions to be n-8, setting the dimension of a PHOG feature to be (1+4+16+64) -8-680, setting a color histogram to be an important description operator which shows prominence in an identification task, firstly converting an RGB image into an HSV image in order to obtain the HSV histogram feature, and setting the dimension of the HSV histogram feature to be 8-512;

in addition, 128-dimensional SIFT features are extracted from the selected picture blocks.

Optionally, the step S7 includes:

will dist_i,jIs defined as a feature x_iAnd x_jThe distance between views across different cameras;

wherein w_iW ≧ 0, W ═ diag (W) is a diagonal matrix, W_ii＝w_i(ii) a W may be determined by learning; d represents a feature dimension of the feature vector; replacing W with a semi-symmetric matrix M to obtain the Mahalanobis distance;

m represents a metric matrix obtained by metric learning; note that M is a semi-symmetric matrix; directly embedding M into the evaluation of an adjacent classifier, and obtaining the M by optimizing the evaluation performance; the neighbor classifier uses a majority voting method when making a decision; each neighborhood picture block sample is cast with 1 ticket, and the off-site sample is cast with 0 ticket; for sample x_jOf pair x_iThe probability of the classification effect is

Where l is the number of samples; from equation (7), when i ═ j, p_i,jIs the maximum, the accuracy based on L OO is calculated as follows

Wherein omega_iIndicates belonging to x_iA set of subscripts of the same category; the accuracy of the entire sample set is

Then, substitution (9) with formula (7) gives M ═ PP^TTo obtain the optimal target of NCA

By solving the formula (10), a metric matrix M that maximizes the accuracy of the adjacent classifiers can be obtained; finally, a CMC curve for human re-identification is obtained.

The invention has the following beneficial effects:

the method can improve the re-recognition rate of the pedestrian images in different cameras.

That is, the method of the present invention is implemented by sliding. And analyzing the pedestrian image into a semantic region through a Deep Decomposition Network (DDN) to realize the matching of the window and the color.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a process according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a DDN method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating test results of DDN personnel re-identification according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a process for masking a background based on sliding window and color matching according to an embodiment of the present invention;

FIG. 5 is a schematic illustration of salient region selection in accordance with an embodiment of the present invention;

FIG. 6 is a schematic diagram of the significance detection of the GBVS algorithm in accordance with an embodiment of the present invention;

fig. 7(a) to 7(b) are schematic views of CMC curves in comparison in VIPeR, PRID2011 data sets of the method of the present invention and the prior art method, respectively.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

The prior art characterization methods focus mainly on two different aspects: manual features and depth features. More and more discriminating manual features have been developed to achieve an exact match. In recent years, deep learning has been widely used, and has a great breakthrough in almost all visual fields. In particular, in the recognition task, many Convolutional Neural Network (CNN) based methods are proposed to extract depth features. Relevant studies have shown that deep models trained on large-scale datasets (e.g., ImageNet) are very effective.

Background subtraction is an important pre-processing of the image recognition task, eliminating a lot of interference of matching different people in the same context, i.e. eliminating a lot of interference of complex environment on pedestrian matching. Salient regions are receiving increasing attention as distinguishing features in a variety of recognition tasks, such as pedestrian search, multi-row person tracking and behavioral analysis in different camera scenarios. Furthermore, more and more local descriptors are being developed for person re-identification, rather than focusing on only global features, as local detail information has proven to be very useful for person re-identification. Therefore, it is necessary to combine local descriptors with global visual features to form a new feature representation method for pedestrian re-identification based on background subtraction and saliency detection. In addition, many studies have shown that picture block matching can improve the accuracy of person identification. Since the local appearance picture blocks of a person from different angles have high similarity, many methods extract features from local picture blocks and match feature vectors under specific constraints. However, real-time and accuracy require tradeoffs to be made, especially to match local features of the picture blocks. If the picture blocks are selected too much, the accuracy becomes high, but the speed becomes very slow. Based on these problems, the selection of a picture block is of high importance. That is, if all the picture blocks of each image are matched for pedestrian re-recognition, too much time is consumed. However, when a part of the picture blocks is randomly selected, some of the distinguishing picture blocks may be lost, resulting in a decrease in accuracy of pedestrian re-recognition.

In view of the above, it is important to select reliable picture blocks and perform efficient matching. A method for selecting reliable picture blocks by background subtraction and saliency detection is presented. It is worth mentioning that the background subtraction method of the present invention is implemented by sliding window and color matching after decomposing the pedestrian image into semantic regions with a Deep Decomposition Network (DDN), the Deep Decomposition Network (DDN) divides each part of the human body in the whole picture into different colors, and the sliding window selects a picture block with high overlapping rate with color matching, so that the pedestrian can be separated from the environment. The most important innovation is to use local descriptors to make up the deficiency of global features for the task of pedestrian re-identification.

The invention aims at the defects, and the method comprises the steps of firstly, analyzing a pedestrian image into a region with a Deep Decomposition Network (DDN) in the semantic meaning, and separating the pedestrian from a cluttered environment by using a sliding window and color matching method, namely, analyzing the pedestrian image into a semantic region by using the DDN.

The method comprises the following specific steps:

the method comprises the following steps: background subtraction

1) DDN architecture: each image is divided into a plurality of picture blocks, which are then decomposed into semantic regions of a Deep Decomposition Network (DDN). The DDN architecture is used for accurate pedestrian analysis, and combines occlusion estimation and data conversion in a unified deep network. Figure 2 is the architecture of a DDN that directly maps low-level visual features to a label map of a body part. The input isFeature vector x, and the output is a set of labels y₁,...y_nRepresenting a part of the body. This architecture is mainly used for pedestrian analysis. The occlusion estimation method comprises a down-sampling layer, two occlusion estimation layers, two full-connection layers and two decomposition layers. Unlike a Convolutional Neural Network (CNN), each layer of the DDN is completely connected with the next layer, so that the global structure of a person can be captured, and an analysis result is improved, and fig. 3 is a decomposition result diagram of the DDN network on a pedestrian picture. The semantic area of the DDN is utilized to express different parts of the background environment and the human body into different colors, so that pedestrians and the environment can be distinguished, and preparation is made for taking masks for the background below. In this work, portrait images are first parsed into semantic regions (e.g., hair, head, body, arms, and legs) using a Deep Decomposition Network (DDN), and then pedestrians are separated from a cluttered environment using sliding windows and color matching. All of these efforts are a constraint on picture block selection.

2) Firstly, each image is divided into small picture blocks by a grid of m × n, and then whether the picture block needs to be masked is judged by calculating the overlapping rate of the sliding mask window and the colors of all parts of the pedestrian obtained by the semantic region divided by the DDN network, the whole process is shown in FIG. 4, each picture slides on the sliding mask window by using a 10x10 mask window, and the following formula is used for measuring whether the picture block is the background to be masked:

wherein P is_ijPicture blocks represented in ith row and jth column, i, j ∈ N₊{i,j|i≤m,j≤n}，c(P_ij) Showing the sliding masks M and P_ijM is the number of blocks in the horizontal direction after the image has been grid divided into picture blocks, n is the number of blocks in the vertical direction after the image has been grid divided into picture blocks, and u (x) is expressed as a non-zero element in the matrix xThe number of elements. In addition, x_pAnd y_pRepresenting the number of blocks, x, in the horizontal and vertical directions, respectively_p＝m，y_p＝n。c(P_ij)<The 25% picture blocks are reserved and the other picture blocks are masked out. The result of the background subtraction is a basic condition for tile selection, and the reserved color tiles (i.e., tiles) of each image are defined as a set S1 in this embodiment.

Step two: picture block selection

In pedestrian pictures taken by disjoint cameras, the saliency score of a picture block is very important, with some invariance. If tiles from two images of the same person match, then the saliency scores for these tiles should also be close to each other.

And (3) significance detection: based on the focus of attention, the defined area should have the following properties:

making pedestrians more prominent than other disturbers.

It is reliable to search for the same pedestrian through different camera views.

It is easier to identify the same person than abstract features, because if a salient object appears in one camera view, it is usually also salient in another camera view. For example, in FIG. 5, person p1 has a red bag (in grayscale) on their shoulders. p2 has a yellow color bag (in gray scale). p3 hold a red umbrella (grey scale in the figure). And p4 holds a green bag (in grayscale) in the hand. A reliable method of partitioning salient regions is a saliency learning algorithm. It divides the pedestrian into different parts and merges similar pixels. Then, the body part of the divided semantic area is randomly selected to be fed back to the human body for viewing. The person can select the most probable tag return from the tag list according to the visual sense after seeing. However, this method takes a lot of labor and time. Thus, the present embodiment employs graph-based visual saliency (GBVS) to automatically detect salient regions. Furthermore, to reduce the huge cost of matching time, the present embodiment selects only 25 blocks with relatively high significance scores, which is an empirical result of the trade-off between time consumption and matching accuracy. As can be seen from fig. 6, significant areas were detected by the GBVS algorithm, which shows significant agreement with attention disposition of human subjects. In many cases, different people from different camera views have different spatial distributions, while the salient regions of the same pedestrian under different camera views are distinguished from other views. For example, the prominent area in (a1) is a backpack. (a2) There are also similar significant regions in (a) so (a2) is the correct match for (a 1). (a3) The pedestrian's arm is hung with a green bag. (a4) Boys in (c) wear white and green jackets. (a5) The middle woman holds a piece of white paper in his hand. They are all incorrect matches of (a 1). For the same reason, (b2) is the correct match for (b 1). (b3) (ii) a (b4) (ii) a (b5) Is an incorrect match of (b 1).

2) Selecting a picture block:

there are two principles for selecting a picture block. One is the result of background subtraction. The present embodiment obtains a set of picture blocks left after background subtraction for each image S1. The other is a significance test result. In addition, a second condition for picture block selection is set forth based on the computation of the saliency map.

Firstly, the image is put into a Gaussian pyramid, and multi-scale features in the downsampling process are extracted.

Where R (σ) is the initial feature map of the GBVS model and I (x, y) represents the representative image. G (x, y, σ) represents a gaussian pyramid, σ being the scale factor or bandwidth of the gaussian pyramid.

Representing the convolution operator.

Next, an activation map is formed using the feature map, and most importantly, a markov matrix is constructed. The scale of the feature map is assumed to be constant. In other words, the scale σ is ignored. Then the dissimilarity between R (i, j) and R (p, q) is defined as:

where R (i, j) and R (p, q) denote the characteristic values of the pixels at (i, j) and (p, q), respectively. By connecting each node of the trellis R, labeled with indices (i, j) and (p, q), a fully connected directed graph G is obtained_A. A directed edge from node (i, j) to node (p, q) will be assigned a weight:

where σ is a constant representing a free parameter. In the obtained directed graph G_AMarkov chains are defined above. Then, G is added_AAfter the edge weights are normalized, the probability that the state node is converted into another state is obtained by utilizing the stationarity of the Markov chain, so that the significance of the directed graph is estimated, and a significance graph A is obtained.

Finally, the significant graph A is normalized, and a directed graph G is constructed_N. Define a G_NAnd introducing an edge from (i, j) to (p, q):

where A represents the final saliency map, each element of which is required to represent a saliency value for a pixel in that position A is the same size as the original image A each image is divided into tiles of size 10 × 10 of no overlapping area and the tile with the higher saliency value is selected.

s(p_A(i,j))＝average(p_A(i,j))(8)

Wherein p is_A(i, j) representsPictures of ith row and jth column of a. s (p)_A(i, j)) represents p_A(i, j) average significance value. Using 0.6 as s (p)_A(i, j)) using the condition s (p)_A(i,j))>Picture blocks are filtered 0.6 and culled if not eligible, all reserved picture blocks are defined as set S2. the final set of reserved blocks is defined based on the results of background subtraction and saliency detection, e.g., S1 ∩ S2.

Step three: feature fusion

To overcome the disadvantages of both methods and to take advantage of these methods, this section fuses global features with local descriptors to allow clear separation between different pedestrians.

1) And (3) significant matching: calculating the significance value S ((i, j)) or S (P) of each patch in the set S_B(i, j)). The feature extraction process is not parallel considering that the spatial distribution of the selected patches does not correspond exactly, since even for the same pedestrian there is a spatial offset between the image pairs. Feature extraction of the image under camera B must be based on a priori significant spatial distribution of the image under camera a. In order to solve the problem of inconsistent spatial distribution, a saliency picture block matching method is proposed to consider distance tolerance, which is adopted in the embodiment of the invention for the detection of the saliency of GBVS, and the result of the saliency detection on the pedestrian picture is shown in fig. 6. To ensure that the local descriptor feature dimensions extracted from each image are the same, 25 picture blocks from camera a lower set S with higher saliency scores are selected. Using the a priori spatial distribution of significance of these tiles, 25 tiles corresponding to the previous 25 tiles are then found from each picture under camera B, with the nearest neighbor classifier for significance.

Now, the degree of significant similarity between a pair of picture blocks of different images is defined as

The salient picture blocks of the pedestrian image are represented as

Where (A, u) represents the view under camera A, i represents the position of the slice in the image,

is the saliency vector of the picture block. In addition, d (-) is the Euclidean distance, σ_dIs a bandwidth parameter. Finally, the corresponding picture block is obtained under camera B:

I^B,u＝find(min(sim_saliency(P^A,u,P^B,v))) (10)

the significance similarity definition between a pair of picture blocks since different images is brought about by (9):

the function find (-) represents an index to find a tile of the image under camera B based on a saliency match to a tile of the image under camera A.

Is that

U denotes the index, i ∈ {1,2.. 25}, set as described above.

2) Feature extraction: after the appointed picture block index is obtained, the characteristics are extracted from the picture block index. Feature extraction consists of two aspects: global feature extraction and local feature extraction.

L OMO features analyze the horizontal occurrence of local features and maximize the occurrence to stabilize the representation of viewpoint changes, furthermore, to handle illumination changes, Retinex transforms and scale invariant texture operators are applied to the invention method applies HSV color histograms to extract features with 8 x 8 to 512 dimensions for easier pedestrian re-identification than using the original image, in addition to color descriptions, scale invariant local ternary patterns (SI L TP) descriptors are also applied in illumination invariant texture descriptions SI L TP is a well known oneUsing a sub-window of size 10 × 10, in particular, a 5-pixel overlap step is used to locate local tiles in an image of 128 × 48, two scales of the SI L TP histogram are extracted (L BP)

And

) SI L TP size 3⁴A three-scale pyramid representation was constructed, the original 128 × 48 image was downsampled by two 2 × 2 local average pooling operations, and the feature extraction procedure was repeated, so the final features (8 x 8 color bins +34 x 2SI L TP bins) (24+11+5 horizontal groups) were 26960 sizes.

The photo is simply a combination of multiple layers of HOG, each layer of HOG is from HOG of an image with different scales, namely, the image can be enlarged/reduced, and then the standard HOG feature is calculated, namely, the concatenation of the HOG features under different scales is that the number of pyramid layers is L-3, the dimension of the PHOG feature with the number of gradient segmentation n-8 is (1+4+16+64) -8-680, which is an important descriptor and shows prominence in the recognition task.

3) Fusion modeling and metric learning: after the feature extraction is completed, 26960+ (680+512+128) × 20 ═ 53360 dimensional features are obtained. However, before connecting them together, they are fused based on metric learning. Will dist_i,jIs defined as a feature x_iAnd x_jAcross the distance between different camera views.

Wherein w_iW ≧ 0, W ═ diag (W) is a diagonal matrix, W_ii＝w_i. W may be determined by learning. d represents the feature dimension equal to 53360 in this embodiment. And replacing W with a common semi-symmetric matrix M to obtain the Mahalanobis distance.

M denotes a metric matrix obtained by metric learning. Note that M is a semi-symmetric matrix. M is directly embedded into the evaluation of the adjacent classifier, and M is obtained by optimizing the evaluation performance. The neighbor classifier uses a majority voting method in making the decision. Each sample in the neighborhood is billed 1, and the off-site sample is billed 0. For sample x_jOf pair x_iThe probability of the classification effect is

Where l is the number of samples. As can be seen from equation (14), when i ═ j, p_i,jIs the largest if it is recognized that the largest cost rate is the best object L OO based accuracy rate is calculated as follows

Wherein omega_iIndicates belonging to x_iSubscript sets of the same category. The accuracy of the entire sample set is

Then, substitution (16) with formula (14) gives M ═ PP^TTo obtain the optimal target of NCA

By solving equation (17), a metric matrix M can be obtained that maximizes the accuracy of the neighboring classifiers. The distance between two pictures is measured by means of a metric matrix M. The smaller the distance, the more one the two pictures are. Finally, CMC curves for pedestrian re-recognition are plotted based on the obtained results, as shown in fig. 7(a) and 7 (b). Experiments are carried out on different data sets by using several different measurement methods, and the method has good effect.

Experiment on VIPeR:

the proposed algorithm, i.e., method, reached the state of the art at level 1 of 56.83% and 9% above the next best L SS L. comparing the method of the present invention with other algorithms such as L ADF, matsalch, PRDC, E L F, etc., the procedure was repeated 10 times to obtain average performance, all experimental results show that the method of the present invention performed better than others.

Experiment on PRID 2011:

the data set consists of images extracted from a plurality of person trajectories recorded by two different static surveillance cameras. The images from these cameras contain viewpoint variations and significant differences in lighting, background and camera characteristics. The PRID data set has 385 tracks from camera a and 749 tracks from camera B. Of these, only 200 people appear in both cameras. The algorithm, i.e., method, of the present invention performed best (78.3%, 92.6%, 97.5%) compared to several of the most advanced algorithms in rank-1 of the PRID2011 dataset. Fig. 7(b) shows that the method of the present invention performs better than the other methods, and that the results of the algorithm of the present invention are almost as good as the results of the tests at class 10, class 15 and class 20 by the CMC curves.

In FIG. 7, Ours, the algorithm of the present invention, the method of the present invention, and the remainder are prior art methods.

The above embodiments may be referred to each other, and the present embodiment does not limit the embodiments.

Finally, it should be noted that: the above-mentioned embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A pedestrian re-identification method based on body decomposition and significance detection is characterized by comprising the following steps:

s1, aiming at a pedestrian picture to be processed in a first camera, dividing the picture into a plurality of picture blocks, and decomposing the picture blocks into semantic regions of a Deep Decomposition Network (DDN);

s3, automatically detecting a salient region in the picture by adopting a visual salient GBVS identification mode;

2. The method of claim 1, further comprising:

3. The method according to claim 1, wherein the step S2 includes:

wherein, c (P)_ij)<The picture blocks reserved for each image are defined as a set S1, with 25% of the picture blocks reserved.

4. The method according to claim 3, wherein the step S5 includes:

the salient similarity between pairs of picture blocks of different cameras is defined as:

the salient picture blocks of the pedestrian image are represented as

I^B,u＝find(min(sim_saliency(P^A,u,P^B,v))) (3)

substituting equation (2) to obtain:

the function find (-) represents the image block index to find the image under the second camera B based on the saliency match with the picture block of the picture under the first camera A;

is I^B,uU represents the index value, i ∈ {1,2.. 25 }.

5. The method according to claim 4, wherein the step S6 includes:

s63, extracting feature vectors with dimensions of 8 × 8 — 512 using HSV color histograms;

s64, positioning the picture block in 128 × 48 image by using sliding window with size of 10 × 10 and overlapping step of 5 pixels, extracting two scales of SI L TP histogram

And

SI L TP size 3⁴Establishing a three-scale pyramid representation, performing down-sampling on an original 128 × 48 image through two 2 × 2 local average pooling operations, and then repeating a feature vector extraction process to obtain a final feature vector which is (8 × 8+34 × 2) (24+11+5 horizontal groups) ═ 26960-dimensional feature vector;

s65, extracting a PHOG feature vector, an HSV histogram feature vector and an SIFT feature vector from each selected picture block;

setting the pyramid layer number to be L-3, setting the gradient segmentation number to be n-8, setting the dimension of PHOG feature to be (1+4+16+64) -8-680, and setting the color histogram to be an important descriptor which is highlighted in the identification task;

firstly, converting an RGB image into an HSV image, and acquiring HSV histogram features, wherein the dimension of the HSV histogram features is 8 × 8 ═ 512;

and extracting 128-dimensional SIFT features from the selected picture blocks.

6. The method according to claim 4 or 5, wherein the step S7 includes:

Then, substitution (9) with formula (7) gives M ═ PP^TTo obtainOptimal target to NCA

Obtaining a measurement matrix M which maximizes the accuracy of the adjacent classifiers by solving the formula (10); finally, a re-identified CMC curve is obtained.