[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Harmonious Multi-branch Network for Person Re-identification with Harder Triplet Loss

Published: 04 March 2022 Publication History

Abstract

Recently, advances in person re-identification (Re-ID) has benefitted from use of the popular multi-branch network. However, performing feature learning in a single branch with uniform partitioning is likely to separate meaningful local regions, and correlation among different branches is not well established. In this article, we propose a novel harmonious multi-branch network (HMBN) to relieve these intra-branch and inter-branch problems harmoniously. HMBN is a multi-branch network with various stripes on different branches to learn coarse-to-fine pedestrian information. We first replace the uniform partition with a horizontal overlapped partition to cover meaningful local regions between adjacent stripes in a single branch. We then incorporate a novel attention module to make all branches interact by modeling spatial contextual dependencies across branches. Finally, in order to train the HMBN more effectively, a harder triplet loss is introduced to optimize triplets in a harder manner. Extensive experiments are conducted on three benchmark datasets — DukeMTMC-reID, CUHK03, and Market-1501 — demonstrating the superiority of our proposed HMBN over state-of-the-art methods.

1 Introduction

Person re-identification (Re-ID) aims to retrieve a person of interest across non-overlapping camera views in a large image gallery with a given probe. Re-ID is a popular computer vision task for its giant potential in video surveillance applications. Recently, deep learning methods have pushed the performance of Re-ID to a new level. However, many challenges —such as pose variations, illumination variations, view angle variations, and occlusions — make Re-ID non-trivial.
To relieve these issues, many part-based methods [22, 24, 44, 50] with multiple branches have been proposed to learn local features that have achieved promising results. Specifically, these methods combine information in different granularities and learn coarse-to-fine representations in multi-branch networks. Although they achieve state-of-the-art performance, they still suffer from intra-branch and inter-branch problems, that is, the problems of feature learning in a single branch and correlation among different branches.
Feature learning in a single branch. In a single branch, some part-based methods conduct pre-defined horizontal or vertical partitions on feature maps to extract fine-grained information for local feature learning based on the assumption that images are well aligned [5, 9, 41, 45, 57, 58]. Part-based Convolutional Baseline (PCB) [41] achieves competitive results compared with state-of-the-art methods by partitioning feature maps into 6 horizontal stripes. In PCB, as the number of stripes increases, retrieval accuracy improves at first but drops dramatically in the end. The over-increased number of stripes helps to learn fine-grained information but compromises the representational capability in meaningful local regions. We argue that the uniform partition is not optimal, which separates important semantic regions, as shown in Figure 1.
Fig. 1.
Fig. 1. Illustration of intra-branch and inter-branch problems. For the problem of feature learning in a single branch, Branch N employs a uniform partition with 6 stripes. The head is divided into two stripes, which diminishes the representational capability in head regions. For the problem of correlation among different branches, the multi-branch network shares lower layers to learn strongly correlated features and performs independent feature learning in higher layers for different branches. However, strong relations between branches vanish after the split.
Correlation among different branches. As is shown in Figure 1, a multi-branch network shares lower layers and extracts distinct information at higher layers for different branches. The sharing learning scheme builds branch interaction in lower layers by extracting the same low-level features (e.g., edges, lines) for each branch. In this manner, the strongly correlated information in the low-level layer is exploited. However, the interaction among branches is neglected in higher layers of the network after the split.
Triplet loss [12] is a popular loss function in part-based methods with multiple branches because of its enormous capability to optimize the similarity among samples. Triplet loss aims at reducing and enlarging intra-class and inter-class variations. However, there is still room in optimizing triplets.
Optimizing triplets. A triplet contains one anchor, one positive, and one negative. Given an anchor, mining the hard positive and hard negative is an essential part of learning with triplet loss. Schroff et al. [33] select all anchor-positive pairs, and pick hard negatives by semi-hard negative mining. Hermans et al. [12] propose to choose the hardest positives and hardest negatives within a mini-batch. However, triplets selected by the hardest positive and hardest negative mining are still not hard enough for models to discriminate without up-weighting anchor-to-positive distance or down-weighting anchor-to-negative distance. In this manner, intra-class and inter-class variations are difficult to further reduce and enlarge.
In this article, we propose a novel model, a harmonious multi-branch network (HMBN) with harder triplet loss (HTP), to tackle these problems. The HMBN jointly learns pedestrian representations in multi-granularity with three branches called S1B, S2B, and S3B. HMBN adopts S1B to learn global features and applies S2B and S3B to capture fine-grained information. In the single branch, instead of performing a uniform partition, we design a pooling strategy called horizontal overlapped pooling (HOP) to conduct a horizontal overlapped partition on feature maps and cover meaningful local regions between adjacent stripes. Furthermore, to learn interactive features among branches, we incorporate the inter-branch attention module (IBAM), which involves three inter-branch attention submodules (IBASMs). The IBAM enables our HMBN to refine features by aggregating spatial contextual information from different branches in higher layers. In this manner, interaction among branches is preserved in higher layers of the HMBN. In addition, a novel harder triplet loss (HTP) is introduced to optimize intra-class and inter-class similarities more effectively by optimizing triplets in a harder manner. HTP up-weights anchor-to-positive distance and down-weights anchor-to-negative distance by a polynomial mapping function and penalizes more in cases in which anchor-to-positive distance is not substantially smaller than the anchor-to-negative distance. In this process, HTP further reduces and enlarges intra-class and inter-class variations.
To sum up, our main contributions are as follows:
We propose a novel harmonious multi-branch network (HMBN) to learn discriminative pedestrian information by handling intra-branch and inter-branch problems harmoniously.
We design a new pooling strategy named horizontal overlapped pooling (HOP) that helps to keep the balance between learning fine-grained information and extracting features in meaningful local regions.
We incorporate a compound attention module called the inter-branch attention module (IBAM) into the HMBN to learn interactive representations for each branch. To the best of our knowledge, this is the first module that builds strong relations among different branches in higher layers for Re-ID.
We introduce a generalized triplet loss termed harder triplet loss (HTP) to optimize triplets in a harder manner, which is more effective than traditional triplet loss.
Extensive experiments on three datasets show that the HMBN outperforms state-of-the-art methods. In addition, ablation studies verify that HOP, IBAM, and HTP all contribute to an accuracy gain.
This article is an extended version of our early and preliminary conference work [43]. In this extended journal version, we have four modifications. (1) We introduce a multi-branch architecture for robust intra-branch and inter-branch feature learning explicitly. In our conference version, we mainly introduce two independent components (HOP and IBAM) explicitly but ignore the elaborate system in its entirely. The whole system (HMBN) is also a contribution of our work, as it alleviates intra-branch and inter-branch problems harmoniously. (2) Each component is discussed in more detail, for example, an additional comparison between the original uniform partition and our proposed horizontal overlapped partition is presented. (3) We propose an HTP to optimize triplets more effectively. (4) More comprehensive experiments with parameter analysis and visualizations are conducted. Specifically, we add ablation experiments to validate the HTP and an additional ablation study on CUHK03-Labeled and CUHK03-Detected datasets. In additiom, each component is verified more thoroughly compared with the conference version, for example, HOP is ablated both in single-branch and multi-branch networks.

2 Related Works

In this section, we discuss recent related works in terms of part-based Re-ID, attention-based Re-ID, and metric learning–based Re-ID.

2.1 Part-Based Re-ID

First, convolutional neural networks (CNNs) are applied in an image classification task [11, 16, 17, 19, 37, 42]. Thus, it is popular to treat the training process of Re-ID as an image classification task and extract global pedestrian representations. However, use of global features is insufficient for capturing fine-grained cues. Many researchers aggregate global and local features. These methods can be divided into two categories according to the number of branches. The first category is single-branch methods [41, 45, 68]. For example, Varior et al. [45] crop images into several horizontal stripes and process the stripes sequentially by long short-term memory (LSTM) [13] cells. In this manner, contextual information among image regions is leveraged to enhance local feature representative capability. Zhou et al. [68] design a novel OSNet, which is composed of stacked convolutional streams with different receptive field sizes for extracting multi-scale features. The second category is multi-branch methods. Multi-branch methods are superior to single-branch methods in aggregating branch-specific information, that is, fine-grained cues [22, 50], human parsing results [18, 38], pose estimation [30, 39, 54], key points estimation [54], and semantic attributes [24, 44]. For example, in the SPReID framework, Kalayeh et al. [18] utilize the parsing model to generate probability maps associated to 5 predefined human parts and extract robust local features. Sarfraz et al. [32] incorporate 14 main body joint keypoints to model pedestrian information.
However, none of them perform feature map partition with overlap. In contrast, HOP is designed to cover meaningful body regions while extracting fine-grained information.

2.2 Attention-Based Re-ID

Attention mechanisms have verified their effectiveness in many tasks, for example, image classification [16, 48, 53, 60], video classification [25, 29], image caption [6, 55], and in generative adversarial networks (GANs) [27, 59]. It is also efficient and effective for Re-ID tasks with dynamically focusing on salient regions. Some methods employ spatially based attention [14, 23, 34, 38] for feature refinement. For example, Li et al. [23] introduce the HA-CNN to learn soft pixel attention and hard regional attention jointly for robust feature extraction. Channel-based attention is also explored in many works [47, 54]. The attention mechanism is also applicable in frame or feature sequences [3, 15, 20, 36]. Si et al. [36] propose a framework termed DuATM, which is composed of a dual attention mechanism to learn context-aware information by modeling intra-sequence and inter-sequence dependencies.
These works apply an attention mechanism to focus at certain patterns with greater notice within a single branch. In contrast, our proposed IBAM helps to generate refined representations by aggregating information from all branches.

2.3 Metric Learning–Based Re-ID

Metric learning aims to learn a similarity or a mapping function to minimize the intra-class variation while maximizing the inter-class variation. Triplet loss is a widely used loss function that treats the Re-ID task as a ranking task and optimizes the similarity among anchor, positive, and negative samples. Various methods [12, 35, 40, 52] are proposed to select hard triplets for learning discriminately. Batch hard triplet loss [12] is designed with selecting the hardest positives and hardest negatives for robust Re-ID learning. In addition, many methods [61, 69, 70] are proposed to improve the gradient backpropagation. Zhou et al. [69] introduce the center point of the positive pair to model all of the pairwise relationships.
In contrast to previous variations of triplet loss, HTP is a generalized triplet loss to optimize triplets in a harder manner dynamically.

3 Harmonious Multi-branch Network (HMBN)

In this section, we first describe the overall architecture of the HMBN. Then, the coarse-to-fine structure and horizontal overlapped pooling (HOP) are discussed, followed by a novel attention module named the inter-branch attention module (IBAM). Next, an improved triplet loss called harder triplet loss (HTP) is presented. Finally, we discuss the relations between the proposed modules and some existing methods.

3.1 Overall Architecture

As shown in Figure 2, the HMBN is a multi-branch network, including a base module and three independent branches. ResNet-50 [11] is applied for our feature extraction backbone. The base module consists of previous layers before conv4\(\_\)2, which is capable of generating shared low-level visual features for each branch. Specifically, three branches are directly borrowed from subsequent layers after conv4\(\_\)1 — stripe 1 branch (S1B), stripe 2 branch (S2B), and stripe 3 branch (S3B) — based on the number of stripes. S1B performs the Re-ID task at the global level, while S2B and S3B both perform feature learning at the global level and part level. In S2B and S3B, we remove the last spatial down-sampling operation to enrich the granularity. As a result, feature tensors \(\boldsymbol {T}_1\), \(\boldsymbol {T}_2\), and \(\boldsymbol {T}_3\), the output of conv5 from S1B, S2B, and S3B, respectively, have different spatial sizes. In order to integrate multi-branch features, we inject the IBAM on the higher layer of the HMBN to exploit complementary information across branches.
Fig. 2.
Fig. 2. The overall architecture of the proposed HMBN. The HMBN contains a base module and three independent branches: S1B, S2B, and S3B. IBAM is injected in the higher layer of the network for modeling interactive information in different branches. HMBN learns global features by applying GMP on three branches and extracts local features by employing HOP on S2B and S3B. The whole network is trained with classification loss and HTP. GMP is short for global max pooling.
With global max pooling (GMP), the HMBN generates global feature representations \(\boldsymbol {g}_i (i=1,2,3)\) for each branch. A parameter shared 1x1 convolution layer, followed by a batch normalization layer and ReLU layer, is applied to reduce the dimension from 2048-dim \(\boldsymbol {g}_i(i=1,2,3)\) to 256-dim global feature \(\boldsymbol {u}_i(i=1,2,3)\).
With our proposed HOP, HMBN partitions \(\boldsymbol {T}_i(i=2,3)\) into 2 and 3 horizontal stripes in S2B and S3B, and pools these stripes to generate column feature vectors, that is, \(\boldsymbol {p}^n_m\), where \(m\), \(n\) refer to the \(m\)-th stripe in the stripe \(n\) branch. The dimension of \(\boldsymbol {p}^n_m\) is also reduced to 256 by the 1x1 convolution layer to acquire a dimension-reduced local feature \(\boldsymbol {v}^n_m\).

3.2 Coarse-to-Fine Structure

The HMBN is a multi-branch network for coarse-to-fine feature learning. S1B is designed to learn the coarse-grained information (i.e., global information), and S2B and S3B are incorporated to learn fine-grained information with different granularities. We compare the activations of the last convolutional feature maps from B + S1B, B + S2B, and B + S3B in Figure 3. B + S1B is short for a model including base module and S1B, and so forth. B + S1B mainly focuses on the most discriminative regions (e.g., shoulder, shoes). With the increase in number of stripes, more detailed regions can be observed. Regions marked by a red eclipse are ignored by B + S1B but are observed by B + S2B and B + S3B. Regions marked by a yellow eclipse are noticed only by B + S3B.
Fig. 3.
Fig. 3. Visualization results of activations in three coarse-to-fine networks.

3.3 Horizontal Overlapped Pooling (HOP)

Given a feature map \(\boldsymbol {F}\in \mathbb {R}^{C\times {H}\times {W}}\), HOP is illuminated in Figure 4. It has two parameters: \(l\) and \(k\). \(l\) is the total height of overlapped areas in one stripe and \(k\) is the number of stripes. When \(k\) = 1, HOP degrades into GMP, and remains the global information. When \(k \gt\) 1, we learn the fine-grained information. Thus, in HMBN, we keep \(k\) = 2 in S2B and \(k\) = 3 in S3B.
Fig. 4.
Fig. 4. Horizontal overlapped pooling (HOP) in a general form. GMP is short for global max pooling. The size of the overlapped portion is \(C\times {h}\times {W}\). \(l\) is the total height of overlapped areas in one stripe. \(k\) is the number of partitions.
First, we perform a uniform partition on the feature map \(\boldsymbol {F}\) horizontally. With the aim of devoting equal attention to each stripe, parts on the top and bottom are extended in one direction; others are extended in two directions to keep the same spatial size. An overlapped portion is a smaller 3D tensor whose size is \(C\times {h}\times {W}\), where h refers to its height. In this case, \(l = 2h\). As a result, we require that \(l\) must be an even number. Finally, each horizontal stripe is pooled by GMP to generate a part-level vector.
To highlight the difference between a uniform partition and horizontal overlapped partition, we visualize the region covered in each stripe in Figure 5 when the number of stripes is 6. The uniform partition diminishes the representational capability for partitioning the head into 2 stripes. With the horizontal overlapped partition, information for the head region is preserved well.
Fig. 5.
Fig. 5. Comparison between a uniform partition and horizontal overlapped partition when the number of stripes is 6. (a) Uniform partition. (b) Horizontal overlapped partition.

3.4 Inter-Branch Attention Module (IBAM)

Features extracted from different branches together help to boost the feature representational capability. In order to make branches interact, an IBAM is applied, as shown in Figure 2. Features from paired branches are fed into an inter-branch attention submodule (IBASM), which outputs paired refined features. The HMBN has three branches, which form \(C_3^2=3\) combinations when we choose paired branches. As a result, each branch is selected twice and has two refined outputs that build interaction between various branches. The mean of two refined outputs is used to update the original feature, which is represented by the mean operation in Figure 2.
Figure 6 depicts the detailed structure of IBASM. Given two feature maps \(\boldsymbol {A}\in \mathbb {R}^{C\times {H}\times {W}}\), \(\boldsymbol {B}\in \mathbb {R}^{C\times {H}\times {W}}\) from different branches, a 1x1 convolution layer is employed to generate four new feature maps \(\boldsymbol {X}\), \(\boldsymbol {Y}\), \(\boldsymbol {M}\), and \(\boldsymbol {N}\), where \({\boldsymbol {X}, \boldsymbol {Y},\boldsymbol {M}, \boldsymbol {N}}\in \mathbb {R}^{\frac{C}{8}\times {H}\times {W}}\). These four feature maps are reshaped to \(\mathbb {R}^{\frac{C}{8}\times {L}}\), where \(L=H\times {W}\) is the number of feature locations. Pixel-wise similarity in the spatial domain is calculated by matrix multiplication between transposed \(\boldsymbol {X}\) and \(\boldsymbol {N}\). It is then normalized to obtain the spatial attention map \(\boldsymbol {S}\in \mathbb {R}^{L\times {L}}\), as shown here:
\begin{equation} S_{i,j}=\frac{\exp {(m_{i,j})}}{\sum _{i=1}^L \exp {(m_{i,j})}}, m_{ij}=\boldsymbol {X}^T_i{\boldsymbol {N}}_j, \end{equation}
(1)
where \(\boldsymbol {X}_{i}\), \(\boldsymbol {N}_{j}\) denote the \(i^{th}\) and \(j^{th}\) spatial features of \(\boldsymbol {X}\) and \(\boldsymbol {N}\), respectively.
Fig. 6.
Fig. 6. The inter-branch attention submodule (IBASM). “\(\oplus\)” denotes element-wise sum; “\(\otimes\)” denotes matrix multiplication.
To calculate the output \(\boldsymbol {C}\), the HMBN first predicts \(\boldsymbol {A}\) with attention map \(\boldsymbol {S}\) and information from input \(\boldsymbol {B}\). The prediction, which is the result of matrix multiplication between transposed \(\boldsymbol {S}\) and \(\boldsymbol {M}\), is reshaped to \(\mathbb {R}^{C\times {H}\times {W}}\). Then, the HMBN performs an element-wise sum between the weighted prediction and the original \(\boldsymbol {A}\). The output \(\boldsymbol {C}\) is defined as
\begin{equation} {\boldsymbol {C}}_j=\gamma _1\sum _{i=1}^L{{S^T}_{i,j}\boldsymbol {M}_i}+ {\boldsymbol {A}}_j, \end{equation}
(2)
where \(\gamma _1\) is a learnable weight that is initialized as 0. The output \(\boldsymbol {D}\) is defined as
\begin{equation} {\boldsymbol {D}}_j=\gamma _2\sum _{i=1}^L{{S}_{i,j}\boldsymbol {Y}_i}+ {\boldsymbol {B}}_j. \end{equation}
(3)
In this manner, the refined \(\boldsymbol {A}\), which is denoted as \(\boldsymbol {C}\), contains reciprocal information from \(\boldsymbol {B}\). The refined \(\boldsymbol {B}\), which is denoted as \(\boldsymbol {D}\), contains reciprocal information from \(\boldsymbol {A}\).
The IBAM can be plugged into two positions in the HMBN: the output of conv4 and the output of conv5. Since the inputs of the IBAM need to have the same size, we apply a modification, removing the last spatial down-sampling operation in S1B when injecting the IBAM on the output of conv5. We find that adding the IBAM at the output of conv4 brings more performance improvement, because keeping the down-sample operation in S1B will produce complementary features when the down-sampling operation is removed in S2B and S3B. For this reason, the IBAM is placed on the output layer of conv4.
It is worth noted that the proposed IBAM is pluggable and can be injected into any existing multi-branch network because IBASM does not change the size of the feature map.
Armed with our proposed IBAM, spatial contextual dependencies across branches are well established and the interactive information in multi-granularity is utilized in higher layers.

3.5 Harder Triplet Loss (HTP)

In this subsection, we revisit the traditional triplet loss, then discuss its drawbacks in optimizing triplets. Finally, we propose the harder triplet loss (HTP) for making up these deficiencies.
Normally, triplet loss is trained on a set of triplet units \(\lbrace (\boldsymbol {x}, \boldsymbol {x}^+, \boldsymbol {x}^-)\rbrace\), in which \((\boldsymbol {x}, \boldsymbol {x}^+)\) represents a positive pair from the same pedestrian and a negative pair \((\boldsymbol {x}, \boldsymbol {x}^-)\) represents images from different pedestrians. Given one triplet \((\boldsymbol {x}, \boldsymbol {x}^+, \boldsymbol {x}^-)\), triplet loss is formulated as
\begin{equation} \begin{aligned}L_{tri}(f(\boldsymbol {x}), f(\boldsymbol {x}^+), f(\boldsymbol {x}^-)) &= \left[m+ d_{a,p} - d_{a,n}\right]_+, \\ d_{a,p} &= d\left(f(\boldsymbol {x}),f(\boldsymbol {x}^+) \right),\\ d_{a,n} &= d\left(f(\boldsymbol {x}),f(\boldsymbol {x}^-) \right), \end{aligned} \end{equation}
(4)
where \(m\) is the margin parameter, \(d_{a,p}\) and \(d_{a,n}\) are short for anchor-to-positive distance and anchor-to-negative distance, \(d(\cdot)\) is the Euclidean distance, \(\left[\cdot \right]_+\) denotes \(max(\cdot ,0)\), and \(f(\boldsymbol {x})\), \(f(\boldsymbol {x}^+)\), and \(f(\boldsymbol {x}^-)\) are features of sample \(\boldsymbol {x}\), \(\boldsymbol {x}^+\), and \(\boldsymbol {x}^-\), respectively.
The core idea for triplet loss is to optimize the similarity in triplets so that \(d_{a,n}\) should be larger than \(d_{a,p}\) by the margin \(m\). However, the optimization can still be improved, as shown in Figure 7. To compare traditional triplet loss and HTP, we analyze the empirical distribution of the relative distance of a positive pair and negative pair, which is defined as \([m+ d_{a,p}-d_{a,n} ]_+\), from a converged HMBN. The number of samples is displayed by log axis because easy samples have extremely large numbers. With our proposed HTP, the majority of samples move left in Figure 7(b) compared with Figure 7(a), indicating that HTP helps intra-class variation and inter-class variation further reduced and enlarged.
Fig. 7.
Fig. 7. Empirical distribution of relative distance of a positive pair and negative pair from a converged HMBN model trained with traditional triplet loss and HTP on the DukeMTMC-reID dataset.
For optimizing triplets in a harder manner, HTP penalizes large \(d_{a,p}\) and small \(d_{a,n}\) with polynomial mapping function, defined as follows:
\begin{equation} \begin{aligned} \widetilde{d}_{a,p}=\left(d_{a,p}+1\right)^{(1+\alpha)}-1, \end{aligned} \end{equation}
(5)
\begin{equation} \begin{aligned} \widetilde{d}_{a,n}=\left(d_{a,n}+1\right)^{(1-\alpha)}-1, \end{aligned} \end{equation}
(6)
where \(\alpha\) is the scale factor. We update \(d_{a,p}\) and \(d_{a,n}\) with \(\widetilde{d}_{a,p}\) and \(\widetilde{d}_{a,n}\). The polynomial mapping function is visualized for several values of \(\alpha\) in Figure 8. When \(\alpha =0\), \(\widetilde{d}_{a,p}=d_{a,p}\), and \(\widetilde{d}_{a,n}=d_{a,n}\). The larger \(\alpha\) is, the more penalty \(d_{a,p}\) and \(d_{a,n}\) get.
Fig. 8.
Fig. 8. Illustration of polynomial mapping function. (a) Polynomial mapping function on \(d_{ap}\). (b) Polynomial mapping function on \(d_{an}\).
Based on the polynomial mapping function, the HTP is defined as follows:
\begin{equation} \begin{aligned}L_{HTP}(f(\boldsymbol {x}), f(\boldsymbol {x}^+), f(\boldsymbol {x}^-))=\left[m+ \widetilde{d}_{a,p} - \widetilde{d}_{a,n}\right]_+. \end{aligned} \end{equation}
(7)
As shown in Figure 2, global features \(\boldsymbol {u}_i(i=1,2,3)\) are trained with HTP and classification loss. Specifically, HTP on global features can be formulated as
\begin{equation} \begin{aligned}L_{HTP}^{g} = \sum _{i=1}^{N_g}\left(\frac{1}{N_t}\sum _{j=1}^{N_t}L_{HTP}\left(({\boldsymbol {u}_i}^{(j)},{\boldsymbol {u}_i}^{(j+)},{\boldsymbol {u}_i}^{(j-)}\right) \right), \end{aligned} \end{equation}
(8)
where \(N_g\) and \(N_t\) are the numbers of global features and sampled triplets, \({\boldsymbol {u}_i}^{(j)}\), \({\boldsymbol {u}_i}^{(j+)}\), \({\boldsymbol {u}_i}^{(j-)}\) are the feature \({\boldsymbol {u}_i}\) extracted from anchor, positive, and negative samples in the \(j\)-th triplet, respectively. Classification loss on global features can be formulated as
\begin{equation} L_{cls}^{g} = \sum _{i=1}^{N_g}\left({-\frac{1}{N}{\sum _{j=1}^{N}log}\frac{\exp ({((\boldsymbol {W}^i)_{y_j})^T{\boldsymbol {u}_i})}}{{\sum _{k=1}^{C}{\exp ({((\boldsymbol {W}^i)_{k})^T{\boldsymbol {u}_i}})}}}}\right), \end{equation}
(9)
where \(N\), \(C\) are the number of input images and identities, and \(y_j\) is the ground truth of the \(j\)-th input image. \((\boldsymbol {W}^i)_{k}\) is the \(k\)-th column of the fully connected layer whose input is \(\boldsymbol {u}_i\). Local features \(\boldsymbol {v}^n_m\) are trained only with classification loss. Classification loss on local features is formulated as
\begin{equation} L_{cls}^{l} = \sum _{n=2}^{N_b} \sum _{m=1}^{n} \left({-\frac{1}{N}{\sum _{j=1}^{N}log}\frac{\exp ({((\boldsymbol {W}_m^n)_{y_j})^T{\boldsymbol {v}_m^n})}}{{\sum _{k=1}^{C}{\exp ({((\boldsymbol {W}_m^n)_{k})^T{\boldsymbol {v}_m^n}})}}}}\right), \end{equation}
(10)
where \(N_b\) is the number of branches and \((\boldsymbol {W}_m^n)_{k}\) is \(k\)-th column of the fully connected layer whose input is \(\boldsymbol {v}^n_m\). The final loss is defined as follows:
\begin{equation} L=\frac{1}{N_{htp}}L_{HTP}^{g} + \lambda \frac{1}{N_{cls}}\left(L_{cls}^g + L_{cls}^l\right), \end{equation}
(11)
where \(N_{htp}\) and \(N_{cls}\) are the numbers of features trained with HTP and classification loss and \(\lambda\) is the weight of classification loss. Specifically, we set \(\lambda\) to 2 in the following experiments.

3.6 Discussions

This subsection contains a brief discussion of the proposed modules and some similar existing methods that emphasizes the difference between them. However, our proposed modules and the compared existing methods are designed with different purposes, which means that they can hardly be compared in a fair experimental setting.
Relations between HOP and OBM. The overlapping blocks model (OBM) [4] proposes a multiple overlapping blocks structure to pool features from overlapping regions. The OBM requires multiply scaled horizontal partitions. However, HOP performs on a single scale, which is a lightweight method in the training procedure for its relatively fewer fully connected layers.
Relations between IBASM and non-local block. In some ways, the IBASM can be regarded as a variation of the non-local block [51]. The IBASM differs from the non-local block as follows: (1) The IBASM takes two input features while the non-local block takes one input feature. The IBASM performs non-local operations on two features. This modification helps the model refine one input feature with the consideration of the other input feature. (2) The IBASM produces two output features corresponding to two refined input features containing reciprocal information from each other. “Encoder-decoder attention” layers [46] and pair-wise non-local operation [10] both take two input features to compute non-local operations and produce one output feature corresponding to one refined input feature containing the reciprocal information from the other input feature. The IBASM is the first module to build relations between two branches for Re-ID.
Relations between IBAM and PS-MCNN. The IBAM has some similarities with the partially shared multi-task convolutional neural network (PS-MCNN) [1] because both are designed to make branches interact. However, our IBAM is different from a PS-MCNN in three aspects. (1) An IBAM aims to build relations among different branches with various granularities while a PS-MCNN focuses on building relations among different branches with various attribute groups. (2) An IBAM builds interactions among all branches by modeling the relations of paired branches while a PS-MCNN introduces a new Shared Network (SNet) to learn shared information for all branches. In addition, IBAM considers spatial information in the process of interaction, which is ignored by a PS-MCNN. (3) An IBAM is a module that can be easily embedded into any multi-branch network architecture, while a PS-MCNN is a network designed for building interactions among different branches with various attribute groups specifically. Our IBAM is more general than a PS-MCNN.

4 Experiments

In this section, we first describe three datasets and evaluation protocols in our experiments. Then, implementation details are introduced. Next, we compare the retrieval accuracy of the HMBN with state-of-the-art methods on these three datasets. Finally, we carry out ablation studies on DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected to verify the effectiveness of each component. Parameter analysis and visualization are also included.

4.1 Datasets and Evaluation Protocols

We conduct experiments on three popular Re-ID datasets: DukeMTMC-reID [31, 65], CUHK03 [21], and Market-1501 [63]. Dataset statistics are shown in Table 1. \({\bf DukeMTMC-reID}\) is a subset of the DukeMTMC dataset [31] for Re-ID. It is also one of the largest datasets in the Re-ID task. It contains 36,411 images of 1,812 pedestrians captured from 8 high-resolution cameras. There are 702 pedestrians with 16,522 images randomly divided for a training set. The other 702 pedestrians are included in the testing set, in which 2,228 images and 17,661 images are included in the query set and gallery set, respectively. The remaining 408 pedestrians are distractors. \({\bf CUHK03}\) is a relatively small dataset compared with DukeMTMC-reID. It has 1,467 pedestrians with 14,097 images captured by 6 cameras on the CUHK campus. Both manually annotated and DPM-detected bounding boxes are provided, which are denoted as CUHK03-Labeled and CUHK03-Detected. In this article, we use the setting of both of them. \({\bf Market-1501}\) is another large dataset, which is collected by the Deformable Part Model (DPM) detector [8] from 6 cameras. The whole dataset is separated into a training set with 12,936 of 751 pedestrians and a testing set including 3,368 query images of 750 pedestrians and 15,913 gallery images of 751 pedestrians.
Table 1.
 DukeMTMC-reIDCUHK03Market-1501
Train (IDs/Images)702/16,522767/7,365751/12,936
Gallery (IDs/Images)1,110/17,661700/5,332751/15,913
Query (IDs/Images)702/2,228700/1,400750/3,368
Cameras866
Table 1. Dataset Statistics
We report Cumulative Matching Characteristics (CMCs) at rank-1, and the mean Average Precision (mAP) with the single-shot setting on all candidate datasets.

4.2 Implementation Details

The proposed HMBN is implemented using the Pytorch [28] framework on a single NVIDIA GTX 1080 Ti GPU. The weights of ResNet-50 [11] pretrained on ImageNet [7] are adopted to initialize parameters of the HMBN.
In the training phase, we resize the input images to \(384\times {128}\). Then, they are augmented by random horizontal flip, normalization, and random erasing [67]. The total training phase takes 500 epochs. The initial learning rate is set to 2e-4, and then decays to 2e-5, 2e-6 after 320, 380 epochs. An Adam optimizer is used to update the weight parameters with weight decay 5e-4. The batch size is set to 32, in which each identity contains 4 images. The margin in the HTP is set to 1.2 in the following experiments. It should be emphasized that the current experimental setting is slightly different from the conference version in two ways. (1) Different batch sizes —The batch size is set to 32 in this journal version while the batch size in the conference version is 16. (2) Different machines —Experiments from the conference and journal versions are conducted on two independent machines with different hardware and software, that is, CPUs, GPUs, and operation systems. Based on different experimental settings, the results reported in this article are slightly different from the conference version.
In the testing phase, the input images are resized to \(384\times {128}\) and augmented only by normalization. All dimension-reduced (256-dim) global and local features are concatenated as the final embedding vector of a pedestrian.

4.3 Comparison with State-of-the-Art Methods

More than 10 existing state-of-the-art methods are compared with our proposed HMBN on DukeMTMC-reID, CUHK03, and Market-1501 datasets in Table 2, Table 3, and Table 4, respectively. We separate these compared methods into three groups: single-branch methods (S), multi-branch methods (M), and attention-based methods regardless of the number of branches (A).
Table 2.
MethodsDukeMTMC-reID
Rank-1mAP
SMLFN [2] (CVPR2018)81.0062.80
PCB+RPP [41] (ECCV2018)83.3069.20
HPM [9] (AAAI2019)86.6074.30
OSNet [68] (ICCV2019)88.6073.50
 BoT [26] (CVPRW2019)86.4076.40
MPSE [32] (CVPR2018)79.8062.00
HA-CNN [23] (CVPR2018)80.5063.80
C\(A^3\)Net [24] (ACM MM2018)84.6070.20
CAMA [56] (CVPR2019)85.8072.90
HOReID [49] (CVPR2020)86.9075.60
MGN [50] (ACM MM2018)88.7078.40
PISNet [62] (ECCV2020)88.8078.70
ADuATM [36] (CVPR2018)81.8264.58
Mancs [47] (ECCV2018)84.9071.80
AANet-50 [44] (CVPR2019)86.4272.56
CASN [64] (CVPR2019)87.7073.70
HMBN89.8679.68
HMBN (RK)92.1990.44
Table 2. Comparison of HMBN with State-of-the-Art Methods for DukeMTMC-reID Dataset
The top results are in bold. “RK” means re-ranking.
Table 3.
Table 3. Comparison of HMBN with State-of-the-Art Methods for CUHK03-Labeled and CUHK03-Detected Datasets
Table 4.
Table 4. Comparison of HMBN with State-of-the-art Methods for Market-1501 Dataset
\({\bf DukeMTMC-reID.}\) Our proposed HMBN achieves the best result of a Rank-1 accuracy of 89.86% and a mAP of 79.68% on the DukeMTMC-reID dataset. We should emphasize the following. (1) The gaps between HMBN and single-branch methods (MLFN [2], PCB+RPP [41], HPM [9], OSNet [68], and BoT [26]) demonstrate the effectiveness of the multi-branch structure, for example, the HMBN surpasses the BoT by Rank-1/mAP = 3.46%/3.28%. (2) Multi-branch methods (PSE [32], HA-CNN [23], C\(A^3\)Net [24], CAMA [56], HOReID [49], MGN [50], and PISNet [62]) integrate complementary information (e.g., pose estimation, human parsing results, attribute information) into final pedestrian representations, for example, HOReID [49] aligns local features with key-points estimation. Without injecting prior knowledge, such as attributes or poses, the HMBN exceeds the MGN and achieves the best results in this group, by 1.16% in Rank-1 accuracy and 1.28% in mAP. We argue that these methods ignore the interaction among branches in the multi-branch network. On the contrary, the HMBN builds interaction among branches in higher layers of the network by injecting the IBAM. (3) The HMBN outperforms the CASN [64], which achieves the top result in attention-based methods (DuATM [36], Mancs [47], AANet-50 [44], CASN [64]), by 2.16% in Rank-1 and 5.98% in mAP. Instead of modeling intra-branch contextual dependency in attention-based methods, our designed IBAM builds inter-branch dependency. With the help of re-ranking [66], we achieve a higher result of 92.19% in Rank-1 accuracy and 90.44% in mAP.
Figure 9 shows the top-10 ranking results with four query images on the DukeMTMC-reID dataset. Given a query image, the HMBN can retrieve the correct pedestrian under severe visual recognition problems such as view angle variations, illumination variations, and occlusion.
Fig. 9.
Fig. 9. Example results of our HMBN on the DukeMTMC-reID dataset. Given a query image, the top-10 ranking list is presented. Correct and incorrect matches are highlighted green and red, respectively.
\({\bf CUHK03.}\) We report a clear winner case on CUHK03-Labeled and CUHK03-Detected in Table 3. The HMBN achieves the top result of Rank-1 accuracy 78.07%, mAP 75.63% on CUHK03-Labeled and Rank-1 accuracy 75.43%, mAP 73.05% on CUHK03-Detected. The HBMN outperforms the CASN [64] and achieves the best result of all previous existing methods by 4.37% in Rank-1 accuracy, 7.63% in mAP, and 3.93% in Rank-1, 8.65% in mAP, respectively.
\({\bf Market-1501.}\) As illustrated in Table 4, our proposed HMBN achieves competitive results of 94.86% Rank-1 accuracy and 87.45% mAP on Market-1501. Although the MGN and PISNet outperform the HMBN by 0.84% and 0.74% in Rank-1, respectively, the HMBN clearly exceeds all existing methods in terms of mAP (87.45%).

4.4 Ablation Studies

To further verify the effectiveness of each component in the MBN, we present ablation analysis on the DukeMTMC-reID, CUHK03-Labeled and CUHK03-Detected datasets. Parameter analysis and visualization are performed on the DukeMTMC-reID and CUHK03-Labeled datasets.
\({\bf Multi-branch Structure.}\) The effectiveness of uniform partition and multi-branch structure are shown in Table 5 with the comparison of model 1, 2, 3, 4, 5, 6. B + S1B is short for a model including a base module and S1B, B + S1B + S2B (\(l\) = 0) is a multi-branch model composed of a base module, S1B, S2B, and so forth. For uniform partition in the single branch, the number of horizontal stripes controls the granularity of the local feature. With the increase of number of stripes, the performance boosts as well. However, as the number of stripes further increases, the improvement seems to be marginal but enlarges the model parameters: for example, B + S2B (\(l\) = 0) outperforms B + S1B in mAP by 7.93% but B + S3B (\(l\) = 0) outperforms B + S2B (\(l\) = 0) in mAP by 1.98% on DukeMTMC-reID. For the multi-branch structure, a model with a multi-branch structure is better than each composed branch: for example, B + S1B + S2B (\(l\) = 0) + S3B (\(l\) = 0) beats B + S1B, B + S2B (\(l\) = 0), B + S3B (\(l\) = 0), B + S1B + S2B (\(l\) = 0) in Rank-1/mAP by 6.19%/13.74%, 2.92%/5.81%, 2.02%/3.83%, and 0.76%/2.62% on DukeMTMC-reID. As the number of branches further increases, the performance increases as well but marginal: B + S1B + S3B (\(l\) = 0) outperforms B + S1B in mAP by 11.78% but B + S1B + S2B (\(l\) = 0) + S3B (\(l\) = 0) outperforms B + S1B + S2B (\(l\) = 0) in mAP by 2.62% on DukeMTMC-reID. To keep the balance between high retrieval accuracy and low model parameter, we recommend using B + S1B + S2B (\(l\) = 0) + S3B (\(l\) = 0) as the architecture to verify the effectiveness of HOP, IBAM, and HTP.
Table 5.
Table 5. Ablation Studies of HMBN on DukeMTMC-reID, CUHK03-Labeled, CUHK03-Detected Datasets
\({\bf Effectiveness of HOP.}\) Figure 10 shows the Rank-1 accuracy and mAP change with parameter \(l\) in HOP, in which \(l\) is the total height of overlapped areas in one stripe from one branch. To simplify the notation, \(l\) of HOP in S2B is denoted as \(l_2\) and \(l\) of HOP in S3B is denoted as \(l_3\). In the model with one branch S2B (B + S2B), as illustrated in Figure 10(a), the performance rises as well when \(l_2\) is increased, indicating that HOP is essential. However, the accuracy is not always growing with \(l_2\). In the model with one branch S3B (B + S3B), as illustrated in Figure 10(b), with the increase of \(l_3\), the accuracy show the same trend. It is noted that over-increased \(l\) helps to cover meaningful local regions between adjacent parts but diminishes the feature learning in fine-grained cue. A proper \(l\) helps to achieve a good balance between learning fine-grained information and extracting features in meaningful local regions. As is shown in Table 5, \((2, 0)\), \((2, 0)\), \((2, 2)\) are recommended for \((l_2, l_3)\) on DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets.
Fig. 10.
Fig. 10. Parameter analysis for \(l\) in S2B and S3B. (a) Rank-1 and mAP changes with \(l_2\) on DukeMTMC-reID. (b) Rank-1 and mAP changes with \(l_3\) on DukeMTMC-reID. (c) Rank-1 and mAP changes with \({l_3}\) on CUHK03-Labeled. (d) Rank-1 and mAP changes with \(l_3\) on CUHK03-Labeled.
\({\bf Effectiveness of IBAM.}\) The comparison of models 7 and 8 in Table 5 shows the effectiveness of the IBAM. Empirically, the IBAM improves by 0.45%/0.42%, 1.79%/1.26%, and 0.93%/1.73% in Rank-1/mAP on the DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets, respectively (model 7 vs. model 8). To verify whether the IBAM builds interaction among branches, we visualize the activations of the last convolutional feature maps from three branches on DukeMTMC-reID and CUHK03-Labeled in Figure 11. It is noted that (1) S1B performs the Re-ID task at the global level, which is likely to ignore detailed information. The IBAM helps S2B and S3B interact with S1B, and S1B learns the global information with the consideration of local information. Red eclipses mark the detailed information ignored in S1B from the HMBN without the IBAM but considered by S1B from HMBN with the IBAM. (2) S2B and S3B perform the Re-ID task at the part level, which fails in learning consecutive local regions. S2B and S3B successfully keep an eye on larger local areas with the injection of IBAM. Yellow eclipses mark the consecutive local region ignored in S2B or S3B from the HMBN without the IBAM but considered by S2B or S3B from the HMBN with the IBAM.
Fig. 11.
Fig. 11. Visualization results of activations from three branches in the HMBN. The top and bottom two input images are come from DukeMTMC-reID and CUHK03-Labeled, respectively. For each input image, the activations from the first row and second row are returned from the HMBN without and with the IBAM.
\({\bf Effectiveness of Harder Triplet Loss.}\) The HMBN is trained with classification loss and HTP. Traditional triplet loss can be seen in the special case of HTP if \(\alpha =0\). As shown in Table 5, the HTP outperforms traditional triplet loss in Rank-1/mAP by 0.41%/0.04%, 1.78%/2.08%, and 3.14%/2.26% on the DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets, respectively (model 8 vs. model 9). Parameter analysis for \(\alpha\) in HTP is illustrated in Figure 12. Parameter \(\alpha\) controls the scale factor of anchor-to-positive distance and anchor-to-negative distance. We can see that the performance of THE HMBN is sensitive to \(\alpha\), and parameter \(\alpha\) helps the HMBN achieve a higher retrieval accuracy when \(\alpha =0.01\).
Fig. 12.
Fig. 12. Parameter analysis for \(\alpha\) in the HMBN on the DukeMTMC-reID and CUHK03-Labeled datasets.

5 Conclusions

In this article, we pay attention to widely used multi-branch methods with different stripes and propose a harmonious multi-branch network for Re-ID with HTP. Unlike previous methods that design more branches or more stripes for extracting coarse-to-fine pyramid representations, we analyze how to improve feature learning in a single branch and build interaction among different branches. For feature learning in a single branch, we design the HOP to enhance representational capability in meaningful local regions while extracting fine-grained information. For the interaction among different branches, we incorporate the IBAM to refine representation within a single branch by integrating information from other branches. In addition, we analyze the deficiencies of the commonly applied triplet loss and propose the generalized triplet loss, namely, HTP. Our HTP optimizes triplets in a harder manner, further reducing and enlarging intra-class and inter-class variations. Each component is verified thoroughly in extensive ablation experiments. In addition, the HMBN achieves superior performance compared with state-of-the-art Re-ID methods. In the future, we will explore our idea of harmonious multi-branch learning in more computer vision tasks, that is, image retrieval.

References

[1]
Jiajiong Cao, Yingming Li, and Zhongfei Zhang. 2018. Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 4290–4299.
[2]
Xiaobin Chang, Timothy M. Hospedales, and Tao Xiang. 2018. Multi-level factorisation net for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 2109–2118.
[3]
Guangyi Chen, Jiwen Lu, Ming Yang, and Jie Zhou. 2019. Spatial-temporal attention-aware learning for video-based person re-identification. IEEE Transactions on Image Processing 28, 9 (2019), 4192–4205.
[4]
Yipeng Chen, Cairong Zhao, and Tianli Sun. 2019. Single image based metric learning via overlapping blocks model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Long Beach, USA, 0–0.
[5]
De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. 2016. Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Las Vegas, USA, 1335–1344.
[6]
Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2019. Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 8307–8316.
[7]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Miami, USA, 248–255.
[8]
Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Anchorage, USA, 1–8.
[9]
Yang Fu, Yunchao Wei, Yuqian Zhou, Honghui Shi, Gao Huang, Xinchao Wang, Zhiqiang Yao, and Thomas Huang. 2019. Horizontal pyramid matching for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI Press, Honolulu, USA, 8295–8302.
[10]
Zhihang Fu, Yaowu Chen, Hongwei Yong, Rongxin Jiang, Lei Zhang, and Xian-Sheng Hua. 2019. Foreground gating and background refining network for surveillance object detection. IEEE Transactions on Image Processing 28, 12 (2019), 6077–6090.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Las Vegas, USA, 770–778.
[12]
Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv:1703.07737
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[14]
Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. 2019. Interaction-and-aggregation network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 9317–9326.
[15]
Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. 2019. VRSTC: Occlusion-free video person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 7183–7192.
[16]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 7132–7141.
[17]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, USA, 4700–4708.
[18]
Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gökmen, Mustafa E. Kamasak, and Mubarak Shah. 2018. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 1062–1071.
[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 25. MIT Press, Stateline, USA, 1097–1105.
[20]
Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang. 2018. Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 369–378.
[21]
Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. 2014. DeepReID: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Columbus, USA, 152–159.
[22]
Wei Li, Xiatian Zhu, and Shaogang Gong. 2017. Person re-identification by deep joint learning of multi-loss classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, San Francisco, USA, 2194–2200.
[23]
Wei Li, Xiatian Zhu, and Shaogang Gong. 2018. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 2285–2294.
[24]
Jiawei Liu, Zheng-Jun Zha, Hongtao Xie, Zhiwei Xiong, and Yongdong Zhang. 2018. CA 3 Net: Contextual-attentional attribute-appearance network for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, Seoul, South Korea, 737–745.
[25]
Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Munich, Germany, 7834–7843.
[26]
Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. 2019. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Virtual Seattle, USA, 0–0.
[27]
Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. 2018. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 5657–5666.
[28]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. arXiv:1912.01703
[29]
Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. 2018. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (2018), 773–786.
[30]
Xuelin Qian, Yanwei Fu, Tao Xiang, Wenxuan Wang, Jie Qiu, Yang Wu, Yu-Gang Jiang, and Xiangyang Xue. 2018. Pose-normalized image generation for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, Munich, Germany, 650–667.
[31]
Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision. Springer, Springer, Amsterdam, the Netherlands, 17–35.
[32]
M. Saquib Sarfraz, Arne Schumann, Andreas Eberle, and Rainer Stiefelhagen. 2018. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 420–429.
[33]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, USA, 815–823.
[34]
Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiaogang Wang. 2018. End-to-end deep Kronecker-product matching for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 6886–6895.
[35]
Hailin Shi, Yang Yang, Xiangyu Zhu, Shengcai Liao, Zhen Lei, Weishi Zheng, and Stan Z. Li. 2016. Embedding deep metric for person re-identification: A study against large variations. In European Conference on Computer Vision. Springer, Springer, Amsterdam, the Netherlands, 732–748.
[36]
Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C. Kot, and Gang Wang. 2018. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 5363–5372.
[37]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
[38]
Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. 2018. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 1179–1188.
[39]
Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. 2017. Pose-driven deep convolutional model for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Venice, Italy, 3960–3969.
[40]
Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. 2018. Part-aligned bilinear representations for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, Munich, Germany, 402–419.
[41]
Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, Munich, Germany, 480–496.
[42]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, USA, 1–9.
[43]
Zengming Tang and Jun Huang. 2020. Branch interaction network for person re-identification. In Proceedings of the Asian Conference on Computer Vision (ACCV’20). Springer, Virtual Kyoto, Japan, 322–337.
[44]
Chiat-Pin Tay, Sharmili Roy, and Kim-Hui Yap. 2019. AANet: Attribute attention network for person re-identifications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 7134–7143.
[45]
Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, and Gang Wang. 2016. A Siamese long short-term memory architecture for human re-identification. In European Conference on Computer Vision. Springer, Amsterdam, the Netherlands, 135–153.
[46]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. MIT Press, Long Beach, USA, 5998–6008.
[47]
Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. 2018. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, Munich, Germany, 365–381.
[48]
Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, USA, 3156–3164.
[49]
Guan’an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, and Jian Sun. 2020. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Virtual Nashville, USA, 6449–6458.
[50]
Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia. ACM, Seoul, South Korea, 274–282.
[51]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 7794–7803.
[52]
Yan Wang, Lequn Wang, Yurong You, Xu Zou, Vincent Chen, Serena Li, Gao Huang, Bharath Hariharan, and Kilian Q. Weinberger. 2018. Resource aware person re-identification across multiple resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 8042–8051.
[53]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, Munich, Germany, 3–19.
[54]
Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. 2018. Attention-aware compositional network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 2119–2128.
[55]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, Lille, France, 2048–2057.
[56]
Wenjie Yang, Houjing Huang, Zhang Zhang, Xiaotang Chen, Kaiqi Huang, and Shu Zhang. 2019. Towards rich feature discovery with class activation maps augmentation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 1389–1398.
[57]
Hantao Yao, Shiliang Zhang, Dongming Zhang, Yongdong Zhang, Jintao Li, Yu Wang, and Qi Tian. 2017. Large-scale person re-identification as retrieval. In IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, Hong Kong, China, 1440–1445.
[58]
Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z. Li. 2014. Deep metric learning for person re-identification. In 22nd International Conference on Pattern Recognition. IEEE, Stockholm, Sweden, 34–39.
[59]
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. Self-attention generative adversarial networks. In International Conference on Machine Learning. PMLR, Long Beach, USA, 7354–7363.
[60]
Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander Smola. 2020. ResNeSt: Split-attention networks. arXiv:2004.08955
[61]
Shun Zhang, Jia-Bin Huang, Jongwoo Lim, Yihong Gong, Jinjun Wang, Narendra Ahuja, and Ming-Hsuan Yang. 2020. Tracking persons-of-interest via unsupervised representation adaptation. International Journal of Computer Vision 128, 1 (2020), 96–120.
[62]
Shizhen Zhao, Changxin Gao, Jun Zhang, Hao Cheng, Chuchu Han, Xinyang Jiang, Xiaowei Guo, Wei-Shi Zheng, Nong Sang, and Xing Sun. 2020. Do not disturb me: Person re-identification under the interference of other pedestrians. In European Conference on Computer Vision. Springer, Virtual Glasgow, UK, 647–663.
[63]
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Santiago, Chile, 1116–1124.
[64]
Meng Zheng, Srikrishna Karanam, Ziyan Wu, and Richard J. Radke. 2019. Re-identification with consistent attentive Siamese networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, USA, 5735–5744.
[65]
Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Venice, Italy, 3754–3762.
[66]
Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, USA, 1318–1327.
[67]
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, New York, USA, 13001–13008.
[68]
Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. 2019. Omni-scale feature learning for person re-identification. arXiv:1905.00953
[69]
Sanping Zhou, Fei Wang, Zeyi Huang, and Jinjun Wang. 2019. Discriminative feature learning with consistent attention regularization for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, Long Beach, USA, 8040–8049.
[70]
Sanping Zhou, Jinjun Wang, Jiayun Wang, Yihong Gong, and Nanning Zheng. 2017. Point to set similarity based deep feature learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Honolulu, USA, 3741–3750.

Cited By

View all
  • (2024)Towards Robust Person Re-Identification by Adversarial Training With Dynamic Attack StrategyIEEE Transactions on Multimedia10.1109/TMM.2024.340767726(10367-10380)Online publication date: 2024
  • (2024)Improving unsupervised pedestrian re‐identification with enhanced feature representation and robust clusteringIET Computer Vision10.1049/cvi2.12309Online publication date: 26-Aug-2024
  • (2024)A bidirectional fusion branch network with penalty term-based trihard loss for person re-identificationJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10397297:COnline publication date: 27-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 4
November 2022
497 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3514185
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2022
Accepted: 01 November 2021
Received: 01 June 2021
Published in TOMM Volume 18, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Person re-identification
  2. pooling strategy
  3. attention mechanism
  4. triplet loss

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Key R&D Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)676
  • Downloads (Last 6 weeks)113
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Robust Person Re-Identification by Adversarial Training With Dynamic Attack StrategyIEEE Transactions on Multimedia10.1109/TMM.2024.340767726(10367-10380)Online publication date: 2024
  • (2024)Improving unsupervised pedestrian re‐identification with enhanced feature representation and robust clusteringIET Computer Vision10.1049/cvi2.12309Online publication date: 26-Aug-2024
  • (2024)A bidirectional fusion branch network with penalty term-based trihard loss for person re-identificationJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10397297:COnline publication date: 27-Feb-2024
  • (2024)Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrievalInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10357561:1Online publication date: 1-Feb-2024
  • (2023)Towards Food Image Retrieval via Generalization-Oriented Sampling and Loss Function DesignACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360009520:1(1-19)Online publication date: 25-Aug-2023
  • (2023)Identity Feature Disentanglement for Visible-Infrared Person Re-IdentificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359518319:6(1-20)Online publication date: 12-Jul-2023
  • (2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
  • (2023)Skeleton-CutMix: Mixing Up Skeleton With Probabilistic Bone Exchange for Supervised Domain AdaptationIEEE Transactions on Image Processing10.1109/TIP.2023.329376632(4046-4058)Online publication date: 1-Jan-2023
  • (2023)Attribute‐guided transformer for robust person re‐identificationIET Computer Vision10.1049/cvi2.1221517:8(977-992)Online publication date: 23-Jun-2023
  • (2023)PoolNet deep feature based person re-identificationMultimedia Tools and Applications10.1007/s11042-023-14364-782:16(24967-24989)Online publication date: 12-Jan-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media