[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108520226B - Pedestrian re-identification method based on body decomposition and significance detection - Google Patents

Pedestrian re-identification method based on body decomposition and significance detection Download PDF

Info

Publication number
CN108520226B
CN108520226B CN201810288204.8A CN201810288204A CN108520226B CN 108520226 B CN108520226 B CN 108520226B CN 201810288204 A CN201810288204 A CN 201810288204A CN 108520226 B CN108520226 B CN 108520226B
Authority
CN
China
Prior art keywords
picture
image
blocks
feature
camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810288204.8A
Other languages
Chinese (zh)
Other versions
CN108520226A (en
Inventor
张云洲
刘一秀
王松
史维东
孙立波
刘双伟
李瑞龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201810288204.8A priority Critical patent/CN108520226B/en
Publication of CN108520226A publication Critical patent/CN108520226A/en
Application granted granted Critical
Publication of CN108520226B publication Critical patent/CN108520226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a pedestrian re-identification method based on body decomposition and significance detection, which comprises the steps of firstly, analyzing a pedestrian image into a region with a Deep Decomposition Network (DDN) in a semantic meaning, separating pedestrians from a cluttered environment by using a sliding window and color method for matching, then, dividing the pedestrian image into small blocks, and automatically selecting an effective picture region according to a background subtraction result and a significance region.

Description

Pedestrian re-identification method based on body decomposition and significance detection
Technical Field
The invention belongs to the field of image processing, and particularly relates to a pedestrian re-identification method based on body decomposition and significance detection.
Background
The purpose of the pedestrian re-identification technology is to identify all images of a person captured by different cameras, which is an important aspect of the intelligent video research field, and is a rapidly developing but still improved subject in the current computer vision field. Under the influence of different visual angles, illumination and scales of shooting, the difference of features such as the posture, the color and the outline of a pedestrian image is large, and how to improve the re-recognition rate of the pedestrian image is still a great challenge.
Therefore, how to increase the re-recognition rate of the pedestrian image becomes a technical problem to be solved at present.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a pedestrian re-identification method based on body decomposition and significance detection, which can improve the re-identification rate of pedestrian images in different cameras.
In a first aspect, the present invention provides a pedestrian re-identification method based on body decomposition and saliency detection, including:
s1, aiming at a pedestrian picture to be processed in a first camera, dividing the picture into a plurality of picture blocks, and decomposing the picture blocks into semantic regions of a Deep Decomposition Network (DDN); that is, the DDN divides each part of the human body in the whole picture into different colors, and the sliding window selects a picture block with a high overlap rate with color matching, so that pedestrians can be separated from the environment;
s2, processing the plurality of picture blocks by a sliding window and color matching mode based on the semantic region of the DDN, and acquiring a reserved picture block set of the picture S1;
s3, automatically detecting a salient region in the picture by adopting a visual salient GBVS identification mode; that is, to reduce the huge cost of matching time, the present embodiment selects 25 blocks with relatively high significance scores, which is an empirical result of the trade-off between time consumption and matching accuracy;
s4, carrying out significance matching on each picture block in the set S1 and the significance region to obtain a significance value of each picture block in the set S1;
s5, acquiring an image block of each pedestrian image to be processed in the second camera and the significance value of each image block based on the image blocks in the set S1 and the significance value of each image block;
the number of the image blocks is consistent with the number of the image blocks in the filtered set S1;
s6, extracting the feature vectors representing all image blocks and the feature vectors representing all image blocks in the set S1;
and S7, fusing the feature vectors of all the image blocks and the feature vectors of all the image blocks based on metric learning, and acquiring the recognition result of the current image in the second camera.
Optionally, the method further comprises:
repeating the steps S5 to S7, and obtaining the identification results of all the images in the second camera.
Optionally, the step S2 includes:
judging whether to take the mask for the picture block or not by calculating the overlapping rate of colors of each part of the pedestrian obtained by the sliding mask window and the semantic region segmented by the DDN network, wherein each image/picture is divided into picture blocks with the size of 10 × 10 and has no overlapping region;
Figure GDA0002469996300000021
wherein P isijPicture blocks represented in ith row and jth column, i, j ∈ N+{ i, j | i ≦ m, j ≦ n }, m being the number of blocks in the horizontal direction after the image has been divided into picture blocks by the grid, n being the number of blocks in the vertical direction after the image has been divided into picture blocks by the grid, c (P)ij) Showing the sliding masks M and PijThe overlapping rate between them; and u (x) is expressed as the number of non-zero elements in the matrix x; in addition, xpAnd ypRepresenting the number of picture blocks in the horizontal and vertical directions, respectively;
wherein, c (P)ij)<The 25% of the picture patches are retained, and the retained patches of each image are defined as a set S1. That is, c (P)ij)<The 25% picture blocks are reserved and the other picture blocks are masked out. The result of the background subtraction is a picture blockThe basic condition for selection, in this embodiment, the reserved color patches of each image are defined as a set S1.
Optionally, the step S5 includes:
finding an image block at a position corresponding to each picture block in the set S1 from the image to be processed of the second camera B based on the position of each picture block in the set S1 of the first camera A;
defining a significant similarity between pairs of picture blocks of different cameras as
Figure GDA0002469996300000031
The salient picture blocks of the pedestrian image are represented as
Figure GDA0002469996300000032
Where (A, u) denotes a picture under the first camera A, i denotes the position of the decomposed picture block in the picture,
Figure GDA0002469996300000033
is the saliency vector of the picture block; d (-) is the Euclidean distance, σdIs a bandwidth parameter; the corresponding picture block is obtained under the second camera B:
IB,u=find(min(simsaliency(PA,u,PB,v))) (3)
substituting equation (2) to obtain:
Figure GDA0002469996300000034
the function find (-) represents the image block index to find the image under the second camera B from the saliency match with the picture block of the picture under the first camera A;
Figure GDA0002469996300000035
is IB,uU represents the index value, i ∈ {1,2.. 25 }.
Optionally, the step S6 includes:
s61, L OMO feature analyzes horizontal feature of local feature and maximizes video to represent stable change of visual point;
s62, processing illumination change by applying Retinex transformation and a scale invariant texture operator;
s63, in order to make it easier to re-identify pedestrians than using the original image, the method of the present invention uses HSV color histograms to extract feature vectors with dimensions 8 × 512;
s64, locating the picture blocks in the 128 × 48 image by using a sliding window with the size of 10 × 10 and an overlapping step of 5 pixels, in order to make the re-identification of the person easier than using the original image, extracting the two scales of the SI L TP histogram
Figure GDA0002469996300000041
And
Figure GDA0002469996300000042
SI L TP size 34Establishing a three-scale pyramid representation, performing down-sampling on an original 128 × 48 image through two 2 × 2 local average pooling operations, and then repeating the feature vector extraction process to obtain the final feature vector which is (8 × 8+34 × 2) (24+11+5 horizontal groups) ═ 26960-dimensional features;
s65, PHOG, HSV histogram and SIFT feature vector are extracted from each selected tile;
setting the number of pyramid layers to be L-3, setting the number of gradient partitions to be n-8, setting the dimension of a PHOG feature to be (1+4+16+64) -8-680, setting a color histogram to be an important description operator which shows prominence in an identification task, firstly converting an RGB image into an HSV image in order to obtain the HSV histogram feature, and setting the dimension of the HSV histogram feature to be 8-512;
in addition, 128-dimensional SIFT features are extracted from the selected picture blocks.
Optionally, the step S7 includes:
will disti,jIs defined as a feature xiAnd xjThe distance between views across different cameras;
Figure GDA0002469996300000043
wherein wiW ≧ 0, W ═ diag (W) is a diagonal matrix, Wii=wi(ii) a W may be determined by learning; d represents a feature dimension of the feature vector; replacing W with a semi-symmetric matrix M to obtain the Mahalanobis distance;
Figure GDA0002469996300000051
m represents a metric matrix obtained by metric learning; note that M is a semi-symmetric matrix; directly embedding M into the evaluation of an adjacent classifier, and obtaining the M by optimizing the evaluation performance; the neighbor classifier uses a majority voting method when making a decision; each neighborhood picture block sample is cast with 1 ticket, and the off-site sample is cast with 0 ticket; for sample xjOf pair xiThe probability of the classification effect is
Figure GDA0002469996300000052
Where l is the number of samples; from equation (7), when i ═ j, pi,jIs the maximum, the accuracy based on L OO is calculated as follows
Figure GDA0002469996300000053
Wherein omegaiIndicates belonging to xiA set of subscripts of the same category; the accuracy of the entire sample set is
Figure GDA0002469996300000054
Then, substitution (9) with formula (7) gives M ═ PPTTo obtain the optimal target of NCA
Figure GDA0002469996300000055
By solving the formula (10), a metric matrix M that maximizes the accuracy of the adjacent classifiers can be obtained; finally, a CMC curve for human re-identification is obtained.
The invention has the following beneficial effects:
the method can improve the re-recognition rate of the pedestrian images in different cameras.
That is, the method of the present invention is implemented by sliding. And analyzing the pedestrian image into a semantic region through a Deep Decomposition Network (DDN) to realize the matching of the window and the color.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a process according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a DDN method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating test results of DDN personnel re-identification according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a process for masking a background based on sliding window and color matching according to an embodiment of the present invention;
FIG. 5 is a schematic illustration of salient region selection in accordance with an embodiment of the present invention;
FIG. 6 is a schematic diagram of the significance detection of the GBVS algorithm in accordance with an embodiment of the present invention;
fig. 7(a) to 7(b) are schematic views of CMC curves in comparison in VIPeR, PRID2011 data sets of the method of the present invention and the prior art method, respectively.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.
The prior art characterization methods focus mainly on two different aspects: manual features and depth features. More and more discriminating manual features have been developed to achieve an exact match. In recent years, deep learning has been widely used, and has a great breakthrough in almost all visual fields. In particular, in the recognition task, many Convolutional Neural Network (CNN) based methods are proposed to extract depth features. Relevant studies have shown that deep models trained on large-scale datasets (e.g., ImageNet) are very effective.
Background subtraction is an important pre-processing of the image recognition task, eliminating a lot of interference of matching different people in the same context, i.e. eliminating a lot of interference of complex environment on pedestrian matching. Salient regions are receiving increasing attention as distinguishing features in a variety of recognition tasks, such as pedestrian search, multi-row person tracking and behavioral analysis in different camera scenarios. Furthermore, more and more local descriptors are being developed for person re-identification, rather than focusing on only global features, as local detail information has proven to be very useful for person re-identification. Therefore, it is necessary to combine local descriptors with global visual features to form a new feature representation method for pedestrian re-identification based on background subtraction and saliency detection. In addition, many studies have shown that picture block matching can improve the accuracy of person identification. Since the local appearance picture blocks of a person from different angles have high similarity, many methods extract features from local picture blocks and match feature vectors under specific constraints. However, real-time and accuracy require tradeoffs to be made, especially to match local features of the picture blocks. If the picture blocks are selected too much, the accuracy becomes high, but the speed becomes very slow. Based on these problems, the selection of a picture block is of high importance. That is, if all the picture blocks of each image are matched for pedestrian re-recognition, too much time is consumed. However, when a part of the picture blocks is randomly selected, some of the distinguishing picture blocks may be lost, resulting in a decrease in accuracy of pedestrian re-recognition.
In view of the above, it is important to select reliable picture blocks and perform efficient matching. A method for selecting reliable picture blocks by background subtraction and saliency detection is presented. It is worth mentioning that the background subtraction method of the present invention is implemented by sliding window and color matching after decomposing the pedestrian image into semantic regions with a Deep Decomposition Network (DDN), the Deep Decomposition Network (DDN) divides each part of the human body in the whole picture into different colors, and the sliding window selects a picture block with high overlapping rate with color matching, so that the pedestrian can be separated from the environment. The most important innovation is to use local descriptors to make up the deficiency of global features for the task of pedestrian re-identification.
The invention aims at the defects, and the method comprises the steps of firstly, analyzing a pedestrian image into a region with a Deep Decomposition Network (DDN) in the semantic meaning, and separating the pedestrian from a cluttered environment by using a sliding window and color matching method, namely, analyzing the pedestrian image into a semantic region by using the DDN.
The method comprises the following specific steps:
the method comprises the following steps: background subtraction
1) DDN architecture: each image is divided into a plurality of picture blocks, which are then decomposed into semantic regions of a Deep Decomposition Network (DDN). The DDN architecture is used for accurate pedestrian analysis, and combines occlusion estimation and data conversion in a unified deep network. Figure 2 is the architecture of a DDN that directly maps low-level visual features to a label map of a body part. The input isFeature vector x, and the output is a set of labels y1,...ynRepresenting a part of the body. This architecture is mainly used for pedestrian analysis. The occlusion estimation method comprises a down-sampling layer, two occlusion estimation layers, two full-connection layers and two decomposition layers. Unlike a Convolutional Neural Network (CNN), each layer of the DDN is completely connected with the next layer, so that the global structure of a person can be captured, and an analysis result is improved, and fig. 3 is a decomposition result diagram of the DDN network on a pedestrian picture. The semantic area of the DDN is utilized to express different parts of the background environment and the human body into different colors, so that pedestrians and the environment can be distinguished, and preparation is made for taking masks for the background below. In this work, portrait images are first parsed into semantic regions (e.g., hair, head, body, arms, and legs) using a Deep Decomposition Network (DDN), and then pedestrians are separated from a cluttered environment using sliding windows and color matching. All of these efforts are a constraint on picture block selection.
2) Firstly, each image is divided into small picture blocks by a grid of m × n, and then whether the picture block needs to be masked is judged by calculating the overlapping rate of the sliding mask window and the colors of all parts of the pedestrian obtained by the semantic region divided by the DDN network, the whole process is shown in FIG. 4, each picture slides on the sliding mask window by using a 10x10 mask window, and the following formula is used for measuring whether the picture block is the background to be masked:
Figure GDA0002469996300000091
wherein P isijPicture blocks represented in ith row and jth column, i, j ∈ N+{i,j|i≤m,j≤n},c(Pij) Showing the sliding masks M and PijM is the number of blocks in the horizontal direction after the image has been grid divided into picture blocks, n is the number of blocks in the vertical direction after the image has been grid divided into picture blocks, and u (x) is expressed as a non-zero element in the matrix xThe number of elements. In addition, xpAnd ypRepresenting the number of blocks, x, in the horizontal and vertical directions, respectivelyp=m,yp=n。c(Pij)<The 25% picture blocks are reserved and the other picture blocks are masked out. The result of the background subtraction is a basic condition for tile selection, and the reserved color tiles (i.e., tiles) of each image are defined as a set S1 in this embodiment.
Step two: picture block selection
In pedestrian pictures taken by disjoint cameras, the saliency score of a picture block is very important, with some invariance. If tiles from two images of the same person match, then the saliency scores for these tiles should also be close to each other.
And (3) significance detection: based on the focus of attention, the defined area should have the following properties:
making pedestrians more prominent than other disturbers.
It is reliable to search for the same pedestrian through different camera views.
It is easier to identify the same person than abstract features, because if a salient object appears in one camera view, it is usually also salient in another camera view. For example, in FIG. 5, person p1 has a red bag (in grayscale) on their shoulders. p2 has a yellow color bag (in gray scale). p3 hold a red umbrella (grey scale in the figure). And p4 holds a green bag (in grayscale) in the hand. A reliable method of partitioning salient regions is a saliency learning algorithm. It divides the pedestrian into different parts and merges similar pixels. Then, the body part of the divided semantic area is randomly selected to be fed back to the human body for viewing. The person can select the most probable tag return from the tag list according to the visual sense after seeing. However, this method takes a lot of labor and time. Thus, the present embodiment employs graph-based visual saliency (GBVS) to automatically detect salient regions. Furthermore, to reduce the huge cost of matching time, the present embodiment selects only 25 blocks with relatively high significance scores, which is an empirical result of the trade-off between time consumption and matching accuracy. As can be seen from fig. 6, significant areas were detected by the GBVS algorithm, which shows significant agreement with attention disposition of human subjects. In many cases, different people from different camera views have different spatial distributions, while the salient regions of the same pedestrian under different camera views are distinguished from other views. For example, the prominent area in (a1) is a backpack. (a2) There are also similar significant regions in (a) so (a2) is the correct match for (a 1). (a3) The pedestrian's arm is hung with a green bag. (a4) Boys in (c) wear white and green jackets. (a5) The middle woman holds a piece of white paper in his hand. They are all incorrect matches of (a 1). For the same reason, (b2) is the correct match for (b 1). (b3) (ii) a (b4) (ii) a (b5) Is an incorrect match of (b 1).
2) Selecting a picture block:
there are two principles for selecting a picture block. One is the result of background subtraction. The present embodiment obtains a set of picture blocks left after background subtraction for each image S1. The other is a significance test result. In addition, a second condition for picture block selection is set forth based on the computation of the saliency map.
Firstly, the image is put into a Gaussian pyramid, and multi-scale features in the downsampling process are extracted.
Figure GDA0002469996300000101
Figure GDA0002469996300000102
Where R (σ) is the initial feature map of the GBVS model and I (x, y) represents the representative image. G (x, y, σ) represents a gaussian pyramid, σ being the scale factor or bandwidth of the gaussian pyramid.
Figure GDA0002469996300000103
Representing the convolution operator.
Next, an activation map is formed using the feature map, and most importantly, a markov matrix is constructed. The scale of the feature map is assumed to be constant. In other words, the scale σ is ignored. Then the dissimilarity between R (i, j) and R (p, q) is defined as:
Figure GDA0002469996300000111
where R (i, j) and R (p, q) denote the characteristic values of the pixels at (i, j) and (p, q), respectively. By connecting each node of the trellis R, labeled with indices (i, j) and (p, q), a fully connected directed graph G is obtainedA. A directed edge from node (i, j) to node (p, q) will be assigned a weight:
Figure GDA0002469996300000112
Figure GDA0002469996300000113
where σ is a constant representing a free parameter. In the obtained directed graph GAMarkov chains are defined above. Then, G is addedAAfter the edge weights are normalized, the probability that the state node is converted into another state is obtained by utilizing the stationarity of the Markov chain, so that the significance of the directed graph is estimated, and a significance graph A is obtained.
Finally, the significant graph A is normalized, and a directed graph G is constructedN. Define a GNAnd introducing an edge from (i, j) to (p, q):
Figure GDA0002469996300000114
where A represents the final saliency map, each element of which is required to represent a saliency value for a pixel in that position A is the same size as the original image A each image is divided into tiles of size 10 × 10 of no overlapping area and the tile with the higher saliency value is selected.
s(pA(i,j))=average(pA(i,j))(8)
Wherein p isA(i, j) representsPictures of ith row and jth column of a. s (p)A(i, j)) represents pA(i, j) average significance value. Using 0.6 as s (p)A(i, j)) using the condition s (p)A(i,j))>Picture blocks are filtered 0.6 and culled if not eligible, all reserved picture blocks are defined as set S2. the final set of reserved blocks is defined based on the results of background subtraction and saliency detection, e.g., S1 ∩ S2.
Step three: feature fusion
To overcome the disadvantages of both methods and to take advantage of these methods, this section fuses global features with local descriptors to allow clear separation between different pedestrians.
1) And (3) significant matching: calculating the significance value S ((i, j)) or S (P) of each patch in the set SB(i, j)). The feature extraction process is not parallel considering that the spatial distribution of the selected patches does not correspond exactly, since even for the same pedestrian there is a spatial offset between the image pairs. Feature extraction of the image under camera B must be based on a priori significant spatial distribution of the image under camera a. In order to solve the problem of inconsistent spatial distribution, a saliency picture block matching method is proposed to consider distance tolerance, which is adopted in the embodiment of the invention for the detection of the saliency of GBVS, and the result of the saliency detection on the pedestrian picture is shown in fig. 6. To ensure that the local descriptor feature dimensions extracted from each image are the same, 25 picture blocks from camera a lower set S with higher saliency scores are selected. Using the a priori spatial distribution of significance of these tiles, 25 tiles corresponding to the previous 25 tiles are then found from each picture under camera B, with the nearest neighbor classifier for significance.
Now, the degree of significant similarity between a pair of picture blocks of different images is defined as
Figure GDA0002469996300000121
The salient picture blocks of the pedestrian image are represented as
Figure GDA0002469996300000122
Where (A, u) represents the view under camera A, i represents the position of the slice in the image,
Figure GDA0002469996300000123
is the saliency vector of the picture block. In addition, d (-) is the Euclidean distance, σdIs a bandwidth parameter. Finally, the corresponding picture block is obtained under camera B:
IB,u=find(min(simsaliency(PA,u,PB,v))) (10)
the significance similarity definition between a pair of picture blocks since different images is brought about by (9):
Figure GDA0002469996300000124
the function find (-) represents an index to find a tile of the image under camera B based on a saliency match to a tile of the image under camera A.
Figure GDA0002469996300000125
Is that
Figure GDA0002469996300000126
U denotes the index, i ∈ {1,2.. 25}, set as described above.
2) Feature extraction: after the appointed picture block index is obtained, the characteristics are extracted from the picture block index. Feature extraction consists of two aspects: global feature extraction and local feature extraction.
L OMO features analyze the horizontal occurrence of local features and maximize the occurrence to stabilize the representation of viewpoint changes, furthermore, to handle illumination changes, Retinex transforms and scale invariant texture operators are applied to the invention method applies HSV color histograms to extract features with 8 x 8 to 512 dimensions for easier pedestrian re-identification than using the original image, in addition to color descriptions, scale invariant local ternary patterns (SI L TP) descriptors are also applied in illumination invariant texture descriptions SI L TP is a well known oneUsing a sub-window of size 10 × 10, in particular, a 5-pixel overlap step is used to locate local tiles in an image of 128 × 48, two scales of the SI L TP histogram are extracted (L BP)
Figure GDA0002469996300000131
And
Figure GDA0002469996300000132
) SI L TP size 34A three-scale pyramid representation was constructed, the original 128 × 48 image was downsampled by two 2 × 2 local average pooling operations, and the feature extraction procedure was repeated, so the final features (8 x 8 color bins +34 x 2SI L TP bins) (24+11+5 horizontal groups) were 26960 sizes.
The photo is simply a combination of multiple layers of HOG, each layer of HOG is from HOG of an image with different scales, namely, the image can be enlarged/reduced, and then the standard HOG feature is calculated, namely, the concatenation of the HOG features under different scales is that the number of pyramid layers is L-3, the dimension of the PHOG feature with the number of gradient segmentation n-8 is (1+4+16+64) -8-680, which is an important descriptor and shows prominence in the recognition task.
3) Fusion modeling and metric learning: after the feature extraction is completed, 26960+ (680+512+128) × 20 ═ 53360 dimensional features are obtained. However, before connecting them together, they are fused based on metric learning. Will disti,jIs defined as a feature xiAnd xjAcross the distance between different camera views.
Figure GDA0002469996300000141
Wherein wiW ≧ 0, W ═ diag (W) is a diagonal matrix, Wii=wi. W may be determined by learning. d represents the feature dimension equal to 53360 in this embodiment. And replacing W with a common semi-symmetric matrix M to obtain the Mahalanobis distance.
Figure GDA0002469996300000142
M denotes a metric matrix obtained by metric learning. Note that M is a semi-symmetric matrix. M is directly embedded into the evaluation of the adjacent classifier, and M is obtained by optimizing the evaluation performance. The neighbor classifier uses a majority voting method in making the decision. Each sample in the neighborhood is billed 1, and the off-site sample is billed 0. For sample xjOf pair xiThe probability of the classification effect is
Figure GDA0002469996300000143
Where l is the number of samples. As can be seen from equation (14), when i ═ j, pi,jIs the largest if it is recognized that the largest cost rate is the best object L OO based accuracy rate is calculated as follows
Figure GDA0002469996300000144
Wherein omegaiIndicates belonging to xiSubscript sets of the same category. The accuracy of the entire sample set is
Figure GDA0002469996300000145
Then, substitution (16) with formula (14) gives M ═ PPTTo obtain the optimal target of NCA
Figure GDA0002469996300000146
By solving equation (17), a metric matrix M can be obtained that maximizes the accuracy of the neighboring classifiers. The distance between two pictures is measured by means of a metric matrix M. The smaller the distance, the more one the two pictures are. Finally, CMC curves for pedestrian re-recognition are plotted based on the obtained results, as shown in fig. 7(a) and 7 (b). Experiments are carried out on different data sets by using several different measurement methods, and the method has good effect.
Experiment on VIPeR:
the proposed algorithm, i.e., method, reached the state of the art at level 1 of 56.83% and 9% above the next best L SS L. comparing the method of the present invention with other algorithms such as L ADF, matsalch, PRDC, E L F, etc., the procedure was repeated 10 times to obtain average performance, all experimental results show that the method of the present invention performed better than others.
Experiment on PRID 2011:
the data set consists of images extracted from a plurality of person trajectories recorded by two different static surveillance cameras. The images from these cameras contain viewpoint variations and significant differences in lighting, background and camera characteristics. The PRID data set has 385 tracks from camera a and 749 tracks from camera B. Of these, only 200 people appear in both cameras. The algorithm, i.e., method, of the present invention performed best (78.3%, 92.6%, 97.5%) compared to several of the most advanced algorithms in rank-1 of the PRID2011 dataset. Fig. 7(b) shows that the method of the present invention performs better than the other methods, and that the results of the algorithm of the present invention are almost as good as the results of the tests at class 10, class 15 and class 20 by the CMC curves.
In FIG. 7, Ours, the algorithm of the present invention, the method of the present invention, and the remainder are prior art methods.
The above embodiments may be referred to each other, and the present embodiment does not limit the embodiments.
Finally, it should be noted that: the above-mentioned embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A pedestrian re-identification method based on body decomposition and significance detection is characterized by comprising the following steps:
s1, aiming at a pedestrian picture to be processed in a first camera, dividing the picture into a plurality of picture blocks, and decomposing the picture blocks into semantic regions of a Deep Decomposition Network (DDN);
s2, processing the plurality of picture blocks by a sliding window and color matching mode based on the semantic region of the DDN, and acquiring a reserved picture block set of the picture S1;
s3, automatically detecting a salient region in the picture by adopting a visual salient GBVS identification mode;
s4, carrying out significance matching on each picture block in the set S1 and the significance region to obtain a significance value of each picture block in the set S1;
s5, acquiring an image block of each pedestrian image to be processed in the second camera and the significance value of each image block based on the image blocks in the set S1 and the significance value of each image block;
the number of the image blocks is consistent with the number of the image blocks in the filtered set S1;
s6, extracting the feature vectors representing all image blocks and the feature vectors representing all image blocks in the set S1;
and S7, fusing the feature vectors of all the image blocks and the feature vectors of all the image blocks based on metric learning, and acquiring the recognition result of the current image in the second camera.
2. The method of claim 1, further comprising:
repeating the steps S5 to S7, and obtaining the identification results of all the images in the second camera.
3. The method according to claim 1, wherein the step S2 includes:
judging whether to take the mask for the picture block or not by calculating the overlapping rate of colors of each part of the pedestrian obtained by the sliding mask window and the semantic region segmented by the DDN network, wherein each image/picture is divided into picture blocks with the size of 10 × 10 and has no overlapping region;
Figure FDA0002469996290000011
wherein P isijPicture blocks represented in ith row and jth column, i, j ∈ N+{ i, j | i ≦ m, j ≦ n }, m being the number of blocks in the horizontal direction after the image has been divided into picture blocks by the grid, n being the number of blocks in the vertical direction after the image has been divided into picture blocks by the grid, c (P)ij) Showing the sliding masks M and PijThe overlapping rate between them; and u (x) is expressed as the number of non-zero elements in the matrix x; in addition, xpAnd ypRepresenting the number of picture blocks in the horizontal and vertical directions, respectively;
wherein, c (P)ij)<The picture blocks reserved for each image are defined as a set S1, with 25% of the picture blocks reserved.
4. The method according to claim 3, wherein the step S5 includes:
finding an image block at a position corresponding to each picture block in the set S1 from the image to be processed of the second camera B based on the position of each picture block in the set S1 of the first camera A;
the salient similarity between pairs of picture blocks of different cameras is defined as:
Figure FDA0002469996290000021
the salient picture blocks of the pedestrian image are represented as
Figure FDA0002469996290000022
Where (A, u) denotes a picture under the first camera A, i denotes the position of the decomposed picture block in the picture,
Figure FDA0002469996290000023
is the saliency vector of the picture block; d (-) is the Euclidean distance, σdIs a bandwidth parameter; the corresponding picture block is obtained under the second camera B:
IB,u=find(min(simsaliency(PA,u,PB,v))) (3)
substituting equation (2) to obtain:
Figure FDA0002469996290000024
the function find (-) represents the image block index to find the image under the second camera B based on the saliency match with the picture block of the picture under the first camera A;
Figure FDA0002469996290000025
is IB,uU represents the index value, i ∈ {1,2.. 25 }.
5. The method according to claim 4, wherein the step S6 includes:
s61, L OMO feature analyzes horizontal feature of local feature and maximizes video to represent stable change of visual point;
s62, processing illumination change by applying Retinex transformation and a scale invariant texture operator;
s63, extracting feature vectors with dimensions of 8 × 8 — 512 using HSV color histograms;
s64, positioning the picture block in 128 × 48 image by using sliding window with size of 10 × 10 and overlapping step of 5 pixels, extracting two scales of SI L TP histogram
Figure FDA0002469996290000031
And
Figure FDA0002469996290000032
SI L TP size 34Establishing a three-scale pyramid representation, performing down-sampling on an original 128 × 48 image through two 2 × 2 local average pooling operations, and then repeating a feature vector extraction process to obtain a final feature vector which is (8 × 8+34 × 2) (24+11+5 horizontal groups) ═ 26960-dimensional feature vector;
s65, extracting a PHOG feature vector, an HSV histogram feature vector and an SIFT feature vector from each selected picture block;
setting the pyramid layer number to be L-3, setting the gradient segmentation number to be n-8, setting the dimension of PHOG feature to be (1+4+16+64) -8-680, and setting the color histogram to be an important descriptor which is highlighted in the identification task;
firstly, converting an RGB image into an HSV image, and acquiring HSV histogram features, wherein the dimension of the HSV histogram features is 8 × 8 ═ 512;
and extracting 128-dimensional SIFT features from the selected picture blocks.
6. The method according to claim 4 or 5, wherein the step S7 includes:
will disti,jIs defined as a feature xiAnd xjThe distance between views across different cameras;
Figure FDA0002469996290000033
wherein wiW ≧ 0, W ═ diag (W) is a diagonal matrix, Wii=wi(ii) a W may be determined by learning; d represents a feature dimension of the feature vector; replacing W with a semi-symmetric matrix M to obtain the Mahalanobis distance;
Figure FDA0002469996290000041
m represents a metric matrix obtained by metric learning; note that M is a semi-symmetric matrix; directly embedding M into the evaluation of an adjacent classifier, and obtaining the M by optimizing the evaluation performance; the neighbor classifier uses a majority voting method when making a decision; each neighborhood picture block sample is cast with 1 ticket, and the off-site sample is cast with 0 ticket; for sample xjOf pair xiThe probability of the classification effect is
Figure FDA0002469996290000042
Where l is the number of samples; from equation (7), when i ═ j, pi,jIs the maximum, the accuracy based on L OO is calculated as follows
Figure FDA0002469996290000043
Wherein omegaiIndicates belonging to xiA set of subscripts of the same category; the accuracy of the entire sample set is
Figure FDA0002469996290000044
Then, substitution (9) with formula (7) gives M ═ PPTTo obtainOptimal target to NCA
Figure FDA0002469996290000045
Obtaining a measurement matrix M which maximizes the accuracy of the adjacent classifiers by solving the formula (10); finally, a re-identified CMC curve is obtained.
CN201810288204.8A 2018-04-03 2018-04-03 Pedestrian re-identification method based on body decomposition and significance detection Active CN108520226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810288204.8A CN108520226B (en) 2018-04-03 2018-04-03 Pedestrian re-identification method based on body decomposition and significance detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810288204.8A CN108520226B (en) 2018-04-03 2018-04-03 Pedestrian re-identification method based on body decomposition and significance detection

Publications (2)

Publication Number Publication Date
CN108520226A CN108520226A (en) 2018-09-11
CN108520226B true CN108520226B (en) 2020-07-28

Family

ID=63431805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810288204.8A Active CN108520226B (en) 2018-04-03 2018-04-03 Pedestrian re-identification method based on body decomposition and significance detection

Country Status (1)

Country Link
CN (1) CN108520226B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635636B (en) * 2018-10-30 2023-05-09 国家新闻出版广电总局广播科学研究院 Pedestrian re-identification method based on fusion of attribute characteristics and weighted blocking characteristics
CN109614894A (en) * 2018-11-28 2019-04-12 北京陌上花科技有限公司 Pedestrian recognition methods and device, storage medium, server again
CN111435430B (en) * 2019-01-15 2024-02-27 南京人工智能高等研究院有限公司 Object recognition method, object recognition device and electronic equipment
CN110110578B (en) * 2019-02-21 2023-09-29 北京工业大学 Indoor scene semantic annotation method
CN110245310B (en) * 2019-03-06 2023-10-13 腾讯科技(深圳)有限公司 Object behavior analysis method, device and storage medium
CN110335240B (en) * 2019-05-09 2021-07-27 河南萱闱堂医疗信息科技有限公司 Method for automatically grabbing characteristic pictures of tissues or foreign matters in alimentary canal in batches
CN110197154B (en) * 2019-05-30 2021-09-21 汇纳科技股份有限公司 Pedestrian re-identification method, system, medium and terminal integrating three-dimensional mapping of part textures
CN110427868A (en) * 2019-07-30 2019-11-08 上海工程技术大学 A kind of pedestrian identify again in feature extracting method
CN110659589B (en) * 2019-09-06 2022-02-08 中国科学院自动化研究所 Pedestrian re-identification method, system and device based on attitude and attention mechanism
CN110866532B (en) * 2019-11-07 2022-12-30 浙江大华技术股份有限公司 Object matching method and device, storage medium and electronic device
CN111046732B (en) * 2019-11-11 2023-11-28 华中师范大学 Pedestrian re-recognition method based on multi-granularity semantic analysis and storage medium
CN112200009B (en) * 2020-09-15 2023-10-17 青岛邃智信息科技有限公司 Pedestrian re-identification method based on key point feature alignment in community monitoring scene
CN112906679B (en) * 2021-05-08 2021-07-23 深圳市安软科技股份有限公司 Pedestrian re-identification method, system and related equipment based on human shape semantic segmentation
CN113408492B (en) * 2021-07-23 2022-06-14 四川大学 Pedestrian re-identification method based on global-local feature dynamic alignment
CN117877068B (en) * 2024-01-04 2024-09-20 哈尔滨理工大学 Mask self-supervision shielding pixel reconstruction-based shielding pedestrian re-identification method
CN118138792B (en) * 2024-05-07 2024-07-30 杭州育恩科技有限公司 Live broadcast method of multimedia teaching

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316031A (en) * 2017-07-04 2017-11-03 北京大学深圳研究生院 The image characteristic extracting method recognized again for pedestrian

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9396412B2 (en) * 2012-06-21 2016-07-19 Siemens Aktiengesellschaft Machine-learnt person re-identification

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316031A (en) * 2017-07-04 2017-11-03 北京大学深圳研究生院 The image characteristic extracting method recognized again for pedestrian

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Fast Tracking via Spatio-Temporal Context Learning based on》;Yixiu Liu等;《Proceedings of the 2017 IEEE International Conference on Information and Automation (ICIA)》;20171023;第398-403页 *
《Graph-based visual saliency》;B. Schlkopf等;《Proceedings of International Conference on Neural Information Processing Systems (2006)》;20071231;第545-552页 *
《Pedestrian Parsing via Deep Decompositional Network》;Ping Luo等;《2013 IEEE International Conference on Computer Vision》;20140303;第2648-2655页 *

Also Published As

Publication number Publication date
CN108520226A (en) 2018-09-11

Similar Documents

Publication Publication Date Title
CN108520226B (en) Pedestrian re-identification method based on body decomposition and significance detection
CN107832672B (en) Pedestrian re-identification method for designing multi-loss function by utilizing attitude information
Fu et al. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition
CN106709449B (en) Pedestrian re-identification method and system based on deep learning and reinforcement learning
CN108052896B (en) Human body behavior identification method based on convolutional neural network and support vector machine
CN107506703B (en) Pedestrian re-identification method based on unsupervised local metric learning and reordering
CN106096561B (en) Infrared pedestrian detection method based on image block deep learning features
CN103632132B (en) Face detection and recognition method based on skin color segmentation and template matching
JP6395481B2 (en) Image recognition apparatus, method, and program
CN109389074B (en) Facial feature point extraction-based expression recognition method
CN110033007B (en) Pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion
CN110929593B (en) Real-time significance pedestrian detection method based on detail discrimination
CN110414368A (en) A kind of unsupervised pedestrian recognition methods again of knowledge based distillation
Bedagkar-Gala et al. Multiple person re-identification using part based spatio-temporal color appearance model
CN107767416B (en) Method for identifying pedestrian orientation in low-resolution image
CN110298297A (en) Flame identification method and device
Shahab et al. How salient is scene text?
CN111563452A (en) Multi-human body posture detection and state discrimination method based on example segmentation
CN104036284A (en) Adaboost algorithm based multi-scale pedestrian detection method
CN108734200B (en) Human target visual detection method and device based on BING (building information network) features
CN109271932A (en) Pedestrian based on color-match recognition methods again
Bhuiyan et al. Person re-identification by discriminatively selecting parts and features
CN110599463A (en) Tongue image detection and positioning algorithm based on lightweight cascade neural network
CN113221770A (en) Cross-domain pedestrian re-identification method and system based on multi-feature hybrid learning
CN114239754B (en) Pedestrian attribute identification method and system based on attribute feature learning decoupling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant