research-article

Open access

Harmonious Multi-branch Network for Person Re-identification with Harder Triplet Loss

Authors:

Zengming Tang,

Jun HuangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 4

Article No.: 98, Pages 1 - 21

https://doi.org/10.1145/3501405

Published: 04 March 2022 Publication History

All formats PDF

Abstract

Recently, advances in person re-identification (Re-ID) has benefitted from use of the popular multi-branch network. However, performing feature learning in a single branch with uniform partitioning is likely to separate meaningful local regions, and correlation among different branches is not well established. In this article, we propose a novel harmonious multi-branch network (HMBN) to relieve these intra-branch and inter-branch problems harmoniously. HMBN is a multi-branch network with various stripes on different branches to learn coarse-to-fine pedestrian information. We first replace the uniform partition with a horizontal overlapped partition to cover meaningful local regions between adjacent stripes in a single branch. We then incorporate a novel attention module to make all branches interact by modeling spatial contextual dependencies across branches. Finally, in order to train the HMBN more effectively, a harder triplet loss is introduced to optimize triplets in a harder manner. Extensive experiments are conducted on three benchmark datasets — DukeMTMC-reID, CUHK03, and Market-1501 — demonstrating the superiority of our proposed HMBN over state-of-the-art methods.

1 Introduction

Person re-identification (Re-ID) aims to retrieve a person of interest across non-overlapping camera views in a large image gallery with a given probe. Re-ID is a popular computer vision task for its giant potential in video surveillance applications. Recently, deep learning methods have pushed the performance of Re-ID to a new level. However, many challenges —such as pose variations, illumination variations, view angle variations, and occlusions — make Re-ID non-trivial.

To relieve these issues, many part-based methods [22, 24, 44, 50] with multiple branches have been proposed to learn local features that have achieved promising results. Specifically, these methods combine information in different granularities and learn coarse-to-fine representations in multi-branch networks. Although they achieve state-of-the-art performance, they still suffer from intra-branch and inter-branch problems, that is, the problems of feature learning in a single branch and correlation among different branches.

Feature learning in a single branch. In a single branch, some part-based methods conduct pre-defined horizontal or vertical partitions on feature maps to extract fine-grained information for local feature learning based on the assumption that images are well aligned [5, 9, 41, 45, 57, 58]. Part-based Convolutional Baseline (PCB) [41] achieves competitive results compared with state-of-the-art methods by partitioning feature maps into 6 horizontal stripes. In PCB, as the number of stripes increases, retrieval accuracy improves at first but drops dramatically in the end. The over-increased number of stripes helps to learn fine-grained information but compromises the representational capability in meaningful local regions. We argue that the uniform partition is not optimal, which separates important semantic regions, as shown in Figure 1.

Fig. 1.

Correlation among different branches. As is shown in Figure 1, a multi-branch network shares lower layers and extracts distinct information at higher layers for different branches. The sharing learning scheme builds branch interaction in lower layers by extracting the same low-level features (e.g., edges, lines) for each branch. In this manner, the strongly correlated information in the low-level layer is exploited. However, the interaction among branches is neglected in higher layers of the network after the split.

Triplet loss [12] is a popular loss function in part-based methods with multiple branches because of its enormous capability to optimize the similarity among samples. Triplet loss aims at reducing and enlarging intra-class and inter-class variations. However, there is still room in optimizing triplets.

Optimizing triplets. A triplet contains one anchor, one positive, and one negative. Given an anchor, mining the hard positive and hard negative is an essential part of learning with triplet loss. Schroff et al. [33] select all anchor-positive pairs, and pick hard negatives by semi-hard negative mining. Hermans et al. [12] propose to choose the hardest positives and hardest negatives within a mini-batch. However, triplets selected by the hardest positive and hardest negative mining are still not hard enough for models to discriminate without up-weighting anchor-to-positive distance or down-weighting anchor-to-negative distance. In this manner, intra-class and inter-class variations are difficult to further reduce and enlarge.

In this article, we propose a novel model, a harmonious multi-branch network (HMBN) with harder triplet loss (HTP), to tackle these problems. The HMBN jointly learns pedestrian representations in multi-granularity with three branches called S1B, S2B, and S3B. HMBN adopts S1B to learn global features and applies S2B and S3B to capture fine-grained information. In the single branch, instead of performing a uniform partition, we design a pooling strategy called horizontal overlapped pooling (HOP) to conduct a horizontal overlapped partition on feature maps and cover meaningful local regions between adjacent stripes. Furthermore, to learn interactive features among branches, we incorporate the inter-branch attention module (IBAM), which involves three inter-branch attention submodules (IBASMs). The IBAM enables our HMBN to refine features by aggregating spatial contextual information from different branches in higher layers. In this manner, interaction among branches is preserved in higher layers of the HMBN. In addition, a novel harder triplet loss (HTP) is introduced to optimize intra-class and inter-class similarities more effectively by optimizing triplets in a harder manner. HTP up-weights anchor-to-positive distance and down-weights anchor-to-negative distance by a polynomial mapping function and penalizes more in cases in which anchor-to-positive distance is not substantially smaller than the anchor-to-negative distance. In this process, HTP further reduces and enlarges intra-class and inter-class variations.

To sum up, our main contributions are as follows:

•

We propose a novel harmonious multi-branch network (HMBN) to learn discriminative pedestrian information by handling intra-branch and inter-branch problems harmoniously.

•

We design a new pooling strategy named horizontal overlapped pooling (HOP) that helps to keep the balance between learning fine-grained information and extracting features in meaningful local regions.

•

We incorporate a compound attention module called the inter-branch attention module (IBAM) into the HMBN to learn interactive representations for each branch. To the best of our knowledge, this is the first module that builds strong relations among different branches in higher layers for Re-ID.

•

We introduce a generalized triplet loss termed harder triplet loss (HTP) to optimize triplets in a harder manner, which is more effective than traditional triplet loss.

•

Extensive experiments on three datasets show that the HMBN outperforms state-of-the-art methods. In addition, ablation studies verify that HOP, IBAM, and HTP all contribute to an accuracy gain.

This article is an extended version of our early and preliminary conference work [43]. In this extended journal version, we have four modifications. (1) We introduce a multi-branch architecture for robust intra-branch and inter-branch feature learning explicitly. In our conference version, we mainly introduce two independent components (HOP and IBAM) explicitly but ignore the elaborate system in its entirely. The whole system (HMBN) is also a contribution of our work, as it alleviates intra-branch and inter-branch problems harmoniously. (2) Each component is discussed in more detail, for example, an additional comparison between the original uniform partition and our proposed horizontal overlapped partition is presented. (3) We propose an HTP to optimize triplets more effectively. (4) More comprehensive experiments with parameter analysis and visualizations are conducted. Specifically, we add ablation experiments to validate the HTP and an additional ablation study on CUHK03-Labeled and CUHK03-Detected datasets. In additiom, each component is verified more thoroughly compared with the conference version, for example, HOP is ablated both in single-branch and multi-branch networks.

2 Related Works

In this section, we discuss recent related works in terms of part-based Re-ID, attention-based Re-ID, and metric learning–based Re-ID.

2.1 Part-Based Re-ID

First, convolutional neural networks (CNNs) are applied in an image classification task [11, 16, 17, 19, 37, 42]. Thus, it is popular to treat the training process of Re-ID as an image classification task and extract global pedestrian representations. However, use of global features is insufficient for capturing fine-grained cues. Many researchers aggregate global and local features. These methods can be divided into two categories according to the number of branches. The first category is single-branch methods [41, 45, 68]. For example, Varior et al. [45] crop images into several horizontal stripes and process the stripes sequentially by long short-term memory (LSTM) [13] cells. In this manner, contextual information among image regions is leveraged to enhance local feature representative capability. Zhou et al. [68] design a novel OSNet, which is composed of stacked convolutional streams with different receptive field sizes for extracting multi-scale features. The second category is multi-branch methods. Multi-branch methods are superior to single-branch methods in aggregating branch-specific information, that is, fine-grained cues [22, 50], human parsing results [18, 38], pose estimation [30, 39, 54], key points estimation [54], and semantic attributes [24, 44]. For example, in the SPReID framework, Kalayeh et al. [18] utilize the parsing model to generate probability maps associated to 5 predefined human parts and extract robust local features. Sarfraz et al. [32] incorporate 14 main body joint keypoints to model pedestrian information.

However, none of them perform feature map partition with overlap. In contrast, HOP is designed to cover meaningful body regions while extracting fine-grained information.

2.2 Attention-Based Re-ID

Attention mechanisms have verified their effectiveness in many tasks, for example, image classification [16, 48, 53, 60], video classification [25, 29], image caption [6, 55], and in generative adversarial networks (GANs) [27, 59]. It is also efficient and effective for Re-ID tasks with dynamically focusing on salient regions. Some methods employ spatially based attention [14, 23, 34, 38] for feature refinement. For example, Li et al. [23] introduce the HA-CNN to learn soft pixel attention and hard regional attention jointly for robust feature extraction. Channel-based attention is also explored in many works [47, 54]. The attention mechanism is also applicable in frame or feature sequences [3, 15, 20, 36]. Si et al. [36] propose a framework termed DuATM, which is composed of a dual attention mechanism to learn context-aware information by modeling intra-sequence and inter-sequence dependencies.

These works apply an attention mechanism to focus at certain patterns with greater notice within a single branch. In contrast, our proposed IBAM helps to generate refined representations by aggregating information from all branches.

2.3 Metric Learning–Based Re-ID

Metric learning aims to learn a similarity or a mapping function to minimize the intra-class variation while maximizing the inter-class variation. Triplet loss is a widely used loss function that treats the Re-ID task as a ranking task and optimizes the similarity among anchor, positive, and negative samples. Various methods [12, 35, 40, 52] are proposed to select hard triplets for learning discriminately. Batch hard triplet loss [12] is designed with selecting the hardest positives and hardest negatives for robust Re-ID learning. In addition, many methods [61, 69, 70] are proposed to improve the gradient backpropagation. Zhou et al. [69] introduce the center point of the positive pair to model all of the pairwise relationships.

In contrast to previous variations of triplet loss, HTP is a generalized triplet loss to optimize triplets in a harder manner dynamically.

3 Harmonious Multi-branch Network (HMBN)

In this section, we first describe the overall architecture of the HMBN. Then, the coarse-to-fine structure and horizontal overlapped pooling (HOP) are discussed, followed by a novel attention module named the inter-branch attention module (IBAM). Next, an improved triplet loss called harder triplet loss (HTP) is presented. Finally, we discuss the relations between the proposed modules and some existing methods.

3.1 Overall Architecture

As shown in Figure 2, the HMBN is a multi-branch network, including a base module and three independent branches. ResNet-50 [11] is applied for our feature extraction backbone. The base module consists of previous layers before conv4\(\_\)2, which is capable of generating shared low-level visual features for each branch. Specifically, three branches are directly borrowed from subsequent layers after conv4\(\_\)1 — stripe 1 branch (S1B), stripe 2 branch (S2B), and stripe 3 branch (S3B) — based on the number of stripes. S1B performs the Re-ID task at the global level, while S2B and S3B both perform feature learning at the global level and part level. In S2B and S3B, we remove the last spatial down-sampling operation to enrich the granularity. As a result, feature tensors \(\boldsymbol {T}_1\), \(\boldsymbol {T}_2\), and \(\boldsymbol {T}_3\), the output of conv5 from S1B, S2B, and S3B, respectively, have different spatial sizes. In order to integrate multi-branch features, we inject the IBAM on the higher layer of the HMBN to exploit complementary information across branches.

Fig. 2.

With global max pooling (GMP), the HMBN generates global feature representations \(\boldsymbol {g}_i (i=1,2,3)\) for each branch. A parameter shared 1x1 convolution layer, followed by a batch normalization layer and ReLU layer, is applied to reduce the dimension from 2048-dim \(\boldsymbol {g}_i(i=1,2,3)\) to 256-dim global feature \(\boldsymbol {u}_i(i=1,2,3)\).

With our proposed HOP, HMBN partitions \(\boldsymbol {T}_i(i=2,3)\) into 2 and 3 horizontal stripes in S2B and S3B, and pools these stripes to generate column feature vectors, that is, \(\boldsymbol {p}^n_m\), where \(m\), \(n\) refer to the \(m\)-th stripe in the stripe \(n\) branch. The dimension of \(\boldsymbol {p}^n_m\) is also reduced to 256 by the 1x1 convolution layer to acquire a dimension-reduced local feature \(\boldsymbol {v}^n_m\).

3.2 Coarse-to-Fine Structure

The HMBN is a multi-branch network for coarse-to-fine feature learning. S1B is designed to learn the coarse-grained information (i.e., global information), and S2B and S3B are incorporated to learn fine-grained information with different granularities. We compare the activations of the last convolutional feature maps from B + S1B, B + S2B, and B + S3B in Figure 3. B + S1B is short for a model including base module and S1B, and so forth. B + S1B mainly focuses on the most discriminative regions (e.g., shoulder, shoes). With the increase in number of stripes, more detailed regions can be observed. Regions marked by a red eclipse are ignored by B + S1B but are observed by B + S2B and B + S3B. Regions marked by a yellow eclipse are noticed only by B + S3B.

Fig. 3.

3.3 Horizontal Overlapped Pooling (HOP)

Given a feature map \(\boldsymbol {F}\in \mathbb {R}^{C\times {H}\times {W}}\), HOP is illuminated in Figure 4. It has two parameters: \(l\) and \(k\). \(l\) is the total height of overlapped areas in one stripe and \(k\) is the number of stripes. When \(k\) = 1, HOP degrades into GMP, and remains the global information. When \(k \gt\) 1, we learn the fine-grained information. Thus, in HMBN, we keep \(k\) = 2 in S2B and \(k\) = 3 in S3B.

Fig. 4.

First, we perform a uniform partition on the feature map \(\boldsymbol {F}\) horizontally. With the aim of devoting equal attention to each stripe, parts on the top and bottom are extended in one direction; others are extended in two directions to keep the same spatial size. An overlapped portion is a smaller 3D tensor whose size is \(C\times {h}\times {W}\), where h refers to its height. In this case, \(l = 2h\). As a result, we require that \(l\) must be an even number. Finally, each horizontal stripe is pooled by GMP to generate a part-level vector.

To highlight the difference between a uniform partition and horizontal overlapped partition, we visualize the region covered in each stripe in Figure 5 when the number of stripes is 6. The uniform partition diminishes the representational capability for partitioning the head into 2 stripes. With the horizontal overlapped partition, information for the head region is preserved well.

Fig. 5.

3.4 Inter-Branch Attention Module (IBAM)

Features extracted from different branches together help to boost the feature representational capability. In order to make branches interact, an IBAM is applied, as shown in Figure 2. Features from paired branches are fed into an inter-branch attention submodule (IBASM), which outputs paired refined features. The HMBN has three branches, which form \(C_3^2=3\) combinations when we choose paired branches. As a result, each branch is selected twice and has two refined outputs that build interaction between various branches. The mean of two refined outputs is used to update the original feature, which is represented by the mean operation in Figure 2.

Figure 6 depicts the detailed structure of IBASM. Given two feature maps \(\boldsymbol {A}\in \mathbb {R}^{C\times {H}\times {W}}\), \(\boldsymbol {B}\in \mathbb {R}^{C\times {H}\times {W}}\) from different branches, a 1x1 convolution layer is employed to generate four new feature maps \(\boldsymbol {X}\), \(\boldsymbol {Y}\), \(\boldsymbol {M}\), and \(\boldsymbol {N}\), where \({\boldsymbol {X}, \boldsymbol {Y},\boldsymbol {M}, \boldsymbol {N}}\in \mathbb {R}^{\frac{C}{8}\times {H}\times {W}}\). These four feature maps are reshaped to \(\mathbb {R}^{\frac{C}{8}\times {L}}\), where \(L=H\times {W}\) is the number of feature locations. Pixel-wise similarity in the spatial domain is calculated by matrix multiplication between transposed \(\boldsymbol {X}\) and \(\boldsymbol {N}\). It is then normalized to obtain the spatial attention map \(\boldsymbol {S}\in \mathbb {R}^{L\times {L}}\), as shown here:

\begin{equation} S_{i,j}=\frac{\exp {(m_{i,j})}}{\sum _{i=1}^L \exp {(m_{i,j})}}, m_{ij}=\boldsymbol {X}^T_i{\boldsymbol {N}}_j, \end{equation}

(1)

where \(\boldsymbol {X}_{i}\), \(\boldsymbol {N}_{j}\) denote the \(i^{th}\) and \(j^{th}\) spatial features of \(\boldsymbol {X}\) and \(\boldsymbol {N}\), respectively.

Fig. 6.

To calculate the output \(\boldsymbol {C}\), the HMBN first predicts \(\boldsymbol {A}\) with attention map \(\boldsymbol {S}\) and information from input \(\boldsymbol {B}\). The prediction, which is the result of matrix multiplication between transposed \(\boldsymbol {S}\) and \(\boldsymbol {M}\), is reshaped to \(\mathbb {R}^{C\times {H}\times {W}}\). Then, the HMBN performs an element-wise sum between the weighted prediction and the original \(\boldsymbol {A}\). The output \(\boldsymbol {C}\) is defined as

\begin{equation} {\boldsymbol {C}}_j=\gamma _1\sum _{i=1}^L{{S^T}_{i,j}\boldsymbol {M}_i}+ {\boldsymbol {A}}_j, \end{equation}

(2)

where \(\gamma _1\) is a learnable weight that is initialized as 0. The output \(\boldsymbol {D}\) is defined as

\begin{equation} {\boldsymbol {D}}_j=\gamma _2\sum _{i=1}^L{{S}_{i,j}\boldsymbol {Y}_i}+ {\boldsymbol {B}}_j. \end{equation}

(3)

In this manner, the refined \(\boldsymbol {A}\), which is denoted as \(\boldsymbol {C}\), contains reciprocal information from \(\boldsymbol {B}\). The refined \(\boldsymbol {B}\), which is denoted as \(\boldsymbol {D}\), contains reciprocal information from \(\boldsymbol {A}\).

The IBAM can be plugged into two positions in the HMBN: the output of conv4 and the output of conv5. Since the inputs of the IBAM need to have the same size, we apply a modification, removing the last spatial down-sampling operation in S1B when injecting the IBAM on the output of conv5. We find that adding the IBAM at the output of conv4 brings more performance improvement, because keeping the down-sample operation in S1B will produce complementary features when the down-sampling operation is removed in S2B and S3B. For this reason, the IBAM is placed on the output layer of conv4.

It is worth noted that the proposed IBAM is pluggable and can be injected into any existing multi-branch network because IBASM does not change the size of the feature map.

Armed with our proposed IBAM, spatial contextual dependencies across branches are well established and the interactive information in multi-granularity is utilized in higher layers.

3.5 Harder Triplet Loss (HTP)

In this subsection, we revisit the traditional triplet loss, then discuss its drawbacks in optimizing triplets. Finally, we propose the harder triplet loss (HTP) for making up these deficiencies.

Normally, triplet loss is trained on a set of triplet units \(\lbrace (\boldsymbol {x}, \boldsymbol {x}^+, \boldsymbol {x}^-)\rbrace\), in which \((\boldsymbol {x}, \boldsymbol {x}^+)\) represents a positive pair from the same pedestrian and a negative pair \((\boldsymbol {x}, \boldsymbol {x}^-)\) represents images from different pedestrians. Given one triplet \((\boldsymbol {x}, \boldsymbol {x}^+, \boldsymbol {x}^-)\), triplet loss is formulated as

\begin{equation} \begin{aligned}L_{tri}(f(\boldsymbol {x}), f(\boldsymbol {x}^+), f(\boldsymbol {x}^-)) &= \left[m+ d_{a,p} - d_{a,n}\right]_+, \\ d_{a,p} &= d\left(f(\boldsymbol {x}),f(\boldsymbol {x}^+) \right),\\ d_{a,n} &= d\left(f(\boldsymbol {x}),f(\boldsymbol {x}^-) \right), \end{aligned} \end{equation}

(4)

where \(m\) is the margin parameter, \(d_{a,p}\) and \(d_{a,n}\) are short for anchor-to-positive distance and anchor-to-negative distance, \(d(\cdot)\) is the Euclidean distance, \(\left[\cdot \right]_+\) denotes \(max(\cdot ,0)\), and \(f(\boldsymbol {x})\), \(f(\boldsymbol {x}^+)\), and \(f(\boldsymbol {x}^-)\) are features of sample \(\boldsymbol {x}\), \(\boldsymbol {x}^+\), and \(\boldsymbol {x}^-\), respectively.

The core idea for triplet loss is to optimize the similarity in triplets so that \(d_{a,n}\) should be larger than \(d_{a,p}\) by the margin \(m\). However, the optimization can still be improved, as shown in Figure 7. To compare traditional triplet loss and HTP, we analyze the empirical distribution of the relative distance of a positive pair and negative pair, which is defined as \([m+ d_{a,p}-d_{a,n} ]_+\), from a converged HMBN. The number of samples is displayed by log axis because easy samples have extremely large numbers. With our proposed HTP, the majority of samples move left in Figure 7(b) compared with Figure 7(a), indicating that HTP helps intra-class variation and inter-class variation further reduced and enlarged.

Fig. 7.

For optimizing triplets in a harder manner, HTP penalizes large \(d_{a,p}\) and small \(d_{a,n}\) with polynomial mapping function, defined as follows:

\begin{equation} \begin{aligned} \widetilde{d}_{a,p}=\left(d_{a,p}+1\right)^{(1+\alpha)}-1, \end{aligned} \end{equation}

(5)

\begin{equation} \begin{aligned} \widetilde{d}_{a,n}=\left(d_{a,n}+1\right)^{(1-\alpha)}-1, \end{aligned} \end{equation}

(6)

where \(\alpha\) is the scale factor. We update \(d_{a,p}\) and \(d_{a,n}\) with \(\widetilde{d}_{a,p}\) and \(\widetilde{d}_{a,n}\). The polynomial mapping function is visualized for several values of \(\alpha\) in Figure 8. When \(\alpha =0\), \(\widetilde{d}_{a,p}=d_{a,p}\), and \(\widetilde{d}_{a,n}=d_{a,n}\). The larger \(\alpha\) is, the more penalty \(d_{a,p}\) and \(d_{a,n}\) get.

Fig. 8.

Based on the polynomial mapping function, the HTP is defined as follows:

\begin{equation} \begin{aligned}L_{HTP}(f(\boldsymbol {x}), f(\boldsymbol {x}^+), f(\boldsymbol {x}^-))=\left[m+ \widetilde{d}_{a,p} - \widetilde{d}_{a,n}\right]_+. \end{aligned} \end{equation}

(7)

As shown in Figure 2, global features \(\boldsymbol {u}_i(i=1,2,3)\) are trained with HTP and classification loss. Specifically, HTP on global features can be formulated as

\begin{equation} \begin{aligned}L_{HTP}^{g} = \sum _{i=1}^{N_g}\left(\frac{1}{N_t}\sum _{j=1}^{N_t}L_{HTP}\left(({\boldsymbol {u}_i}^{(j)},{\boldsymbol {u}_i}^{(j+)},{\boldsymbol {u}_i}^{(j-)}\right) \right), \end{aligned} \end{equation}

(8)

where \(N_g\) and \(N_t\) are the numbers of global features and sampled triplets, \({\boldsymbol {u}_i}^{(j)}\), \({\boldsymbol {u}_i}^{(j+)}\), \({\boldsymbol {u}_i}^{(j-)}\) are the feature \({\boldsymbol {u}_i}\) extracted from anchor, positive, and negative samples in the \(j\)-th triplet, respectively. Classification loss on global features can be formulated as

\begin{equation} L_{cls}^{g} = \sum _{i=1}^{N_g}\left({-\frac{1}{N}{\sum _{j=1}^{N}log}\frac{\exp ({((\boldsymbol {W}^i)_{y_j})^T{\boldsymbol {u}_i})}}{{\sum _{k=1}^{C}{\exp ({((\boldsymbol {W}^i)_{k})^T{\boldsymbol {u}_i}})}}}}\right), \end{equation}

(9)

where \(N\), \(C\) are the number of input images and identities, and \(y_j\) is the ground truth of the \(j\)-th input image. \((\boldsymbol {W}^i)_{k}\) is the \(k\)-th column of the fully connected layer whose input is \(\boldsymbol {u}_i\). Local features \(\boldsymbol {v}^n_m\) are trained only with classification loss. Classification loss on local features is formulated as

\begin{equation} L_{cls}^{l} = \sum _{n=2}^{N_b} \sum _{m=1}^{n} \left({-\frac{1}{N}{\sum _{j=1}^{N}log}\frac{\exp ({((\boldsymbol {W}_m^n)_{y_j})^T{\boldsymbol {v}_m^n})}}{{\sum _{k=1}^{C}{\exp ({((\boldsymbol {W}_m^n)_{k})^T{\boldsymbol {v}_m^n}})}}}}\right), \end{equation}

(10)

where \(N_b\) is the number of branches and \((\boldsymbol {W}_m^n)_{k}\) is \(k\)-th column of the fully connected layer whose input is \(\boldsymbol {v}^n_m\). The final loss is defined as follows:

\begin{equation} L=\frac{1}{N_{htp}}L_{HTP}^{g} + \lambda \frac{1}{N_{cls}}\left(L_{cls}^g + L_{cls}^l\right), \end{equation}

(11)

where \(N_{htp}\) and \(N_{cls}\) are the numbers of features trained with HTP and classification loss and \(\lambda\) is the weight of classification loss. Specifically, we set \(\lambda\) to 2 in the following experiments.

3.6 Discussions

This subsection contains a brief discussion of the proposed modules and some similar existing methods that emphasizes the difference between them. However, our proposed modules and the compared existing methods are designed with different purposes, which means that they can hardly be compared in a fair experimental setting.

Relations between HOP and OBM. The overlapping blocks model (OBM) [4] proposes a multiple overlapping blocks structure to pool features from overlapping regions. The OBM requires multiply scaled horizontal partitions. However, HOP performs on a single scale, which is a lightweight method in the training procedure for its relatively fewer fully connected layers.

Relations between IBASM and non-local block. In some ways, the IBASM can be regarded as a variation of the non-local block [51]. The IBASM differs from the non-local block as follows: (1) The IBASM takes two input features while the non-local block takes one input feature. The IBASM performs non-local operations on two features. This modification helps the model refine one input feature with the consideration of the other input feature. (2) The IBASM produces two output features corresponding to two refined input features containing reciprocal information from each other. “Encoder-decoder attention” layers [46] and pair-wise non-local operation [10] both take two input features to compute non-local operations and produce one output feature corresponding to one refined input feature containing the reciprocal information from the other input feature. The IBASM is the first module to build relations between two branches for Re-ID.

Relations between IBAM and PS-MCNN. The IBAM has some similarities with the partially shared multi-task convolutional neural network (PS-MCNN) [1] because both are designed to make branches interact. However, our IBAM is different from a PS-MCNN in three aspects. (1) An IBAM aims to build relations among different branches with various granularities while a PS-MCNN focuses on building relations among different branches with various attribute groups. (2) An IBAM builds interactions among all branches by modeling the relations of paired branches while a PS-MCNN introduces a new Shared Network (SNet) to learn shared information for all branches. In addition, IBAM considers spatial information in the process of interaction, which is ignored by a PS-MCNN. (3) An IBAM is a module that can be easily embedded into any multi-branch network architecture, while a PS-MCNN is a network designed for building interactions among different branches with various attribute groups specifically. Our IBAM is more general than a PS-MCNN.

4 Experiments

In this section, we first describe three datasets and evaluation protocols in our experiments. Then, implementation details are introduced. Next, we compare the retrieval accuracy of the HMBN with state-of-the-art methods on these three datasets. Finally, we carry out ablation studies on DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected to verify the effectiveness of each component. Parameter analysis and visualization are also included.

4.1 Datasets and Evaluation Protocols

We conduct experiments on three popular Re-ID datasets: DukeMTMC-reID [31, 65], CUHK03 [21], and Market-1501 [63]. Dataset statistics are shown in Table 1. \({\bf DukeMTMC-reID}\) is a subset of the DukeMTMC dataset [31] for Re-ID. It is also one of the largest datasets in the Re-ID task. It contains 36,411 images of 1,812 pedestrians captured from 8 high-resolution cameras. There are 702 pedestrians with 16,522 images randomly divided for a training set. The other 702 pedestrians are included in the testing set, in which 2,228 images and 17,661 images are included in the query set and gallery set, respectively. The remaining 408 pedestrians are distractors. \({\bf CUHK03}\) is a relatively small dataset compared with DukeMTMC-reID. It has 1,467 pedestrians with 14,097 images captured by 6 cameras on the CUHK campus. Both manually annotated and DPM-detected bounding boxes are provided, which are denoted as CUHK03-Labeled and CUHK03-Detected. In this article, we use the setting of both of them. \({\bf Market-1501}\) is another large dataset, which is collected by the Deformable Part Model (DPM) detector [8] from 6 cameras. The whole dataset is separated into a training set with 12,936 of 751 pedestrians and a testing set including 3,368 query images of 750 pedestrians and 15,913 gallery images of 751 pedestrians.

Table 1.

	DukeMTMC-reID	CUHK03	Market-1501
Train (IDs/Images)	702/16,522	767/7,365	751/12,936
Gallery (IDs/Images)	1,110/17,661	700/5,332	751/15,913
Query (IDs/Images)	702/2,228	700/1,400	750/3,368
Cameras	8	6	6

Table 1. Dataset Statistics

We report Cumulative Matching Characteristics (CMCs) at rank-1, and the mean Average Precision (mAP) with the single-shot setting on all candidate datasets.

4.2 Implementation Details

The proposed HMBN is implemented using the Pytorch [28] framework on a single NVIDIA GTX 1080 Ti GPU. The weights of ResNet-50 [11] pretrained on ImageNet [7] are adopted to initialize parameters of the HMBN.

In the training phase, we resize the input images to \(384\times {128}\). Then, they are augmented by random horizontal flip, normalization, and random erasing [67]. The total training phase takes 500 epochs. The initial learning rate is set to 2e-4, and then decays to 2e-5, 2e-6 after 320, 380 epochs. An Adam optimizer is used to update the weight parameters with weight decay 5e-4. The batch size is set to 32, in which each identity contains 4 images. The margin in the HTP is set to 1.2 in the following experiments. It should be emphasized that the current experimental setting is slightly different from the conference version in two ways. (1) Different batch sizes —The batch size is set to 32 in this journal version while the batch size in the conference version is 16. (2) Different machines —Experiments from the conference and journal versions are conducted on two independent machines with different hardware and software, that is, CPUs, GPUs, and operation systems. Based on different experimental settings, the results reported in this article are slightly different from the conference version.

In the testing phase, the input images are resized to \(384\times {128}\) and augmented only by normalization. All dimension-reduced (256-dim) global and local features are concatenated as the final embedding vector of a pedestrian.

4.3 Comparison with State-of-the-Art Methods

More than 10 existing state-of-the-art methods are compared with our proposed HMBN on DukeMTMC-reID, CUHK03, and Market-1501 datasets in Table 2, Table 3, and Table 4, respectively. We separate these compared methods into three groups: single-branch methods (S), multi-branch methods (M), and attention-based methods regardless of the number of branches (A).

Table 2.

Methods		DukeMTMC-reID
Methods		Rank-1	mAP
S	MLFN [2] (CVPR2018)	81.00	62.80
	PCB+RPP [41] (ECCV2018)	83.30	69.20
	HPM [9] (AAAI2019)	86.60	74.30
	OSNet [68] (ICCV2019)	88.60	73.50
	BoT [26] (CVPRW2019)	86.40	76.40
M	PSE [32] (CVPR2018)	79.80	62.00
	HA-CNN [23] (CVPR2018)	80.50	63.80
	C\(A^3\)Net [24] (ACM MM2018)	84.60	70.20
	CAMA [56] (CVPR2019)	85.80	72.90
	HOReID [49] (CVPR2020)	86.90	75.60
	MGN [50] (ACM MM2018)	88.70	78.40
	PISNet [62] (ECCV2020)	88.80	78.70
A	DuATM [36] (CVPR2018)	81.82	64.58
	Mancs [47] (ECCV2018)	84.90	71.80
	AANet-50 [44] (CVPR2019)	86.42	72.56
	CASN [64] (CVPR2019)	87.70	73.70
HMBN		89.86	79.68
HMBN (RK)		92.19	90.44

Table 2. Comparison of HMBN with State-of-the-Art Methods for DukeMTMC-reID Dataset

The top results are in bold. “RK” means re-ranking.

Table 3.

Table 4.

\({\bf DukeMTMC-reID.}\) Our proposed HMBN achieves the best result of a Rank-1 accuracy of 89.86% and a mAP of 79.68% on the DukeMTMC-reID dataset. We should emphasize the following. (1) The gaps between HMBN and single-branch methods (MLFN [2], PCB+RPP [41], HPM [9], OSNet [68], and BoT [26]) demonstrate the effectiveness of the multi-branch structure, for example, the HMBN surpasses the BoT by Rank-1/mAP = 3.46%/3.28%. (2) Multi-branch methods (PSE [32], HA-CNN [23], C\(A^3\)Net [24], CAMA [56], HOReID [49], MGN [50], and PISNet [62]) integrate complementary information (e.g., pose estimation, human parsing results, attribute information) into final pedestrian representations, for example, HOReID [49] aligns local features with key-points estimation. Without injecting prior knowledge, such as attributes or poses, the HMBN exceeds the MGN and achieves the best results in this group, by 1.16% in Rank-1 accuracy and 1.28% in mAP. We argue that these methods ignore the interaction among branches in the multi-branch network. On the contrary, the HMBN builds interaction among branches in higher layers of the network by injecting the IBAM. (3) The HMBN outperforms the CASN [64], which achieves the top result in attention-based methods (DuATM [36], Mancs [47], AANet-50 [44], CASN [64]), by 2.16% in Rank-1 and 5.98% in mAP. Instead of modeling intra-branch contextual dependency in attention-based methods, our designed IBAM builds inter-branch dependency. With the help of re-ranking [66], we achieve a higher result of 92.19% in Rank-1 accuracy and 90.44% in mAP.

Figure 9 shows the top-10 ranking results with four query images on the DukeMTMC-reID dataset. Given a query image, the HMBN can retrieve the correct pedestrian under severe visual recognition problems such as view angle variations, illumination variations, and occlusion.

Fig. 9.

\({\bf CUHK03.}\) We report a clear winner case on CUHK03-Labeled and CUHK03-Detected in Table 3. The HMBN achieves the top result of Rank-1 accuracy 78.07%, mAP 75.63% on CUHK03-Labeled and Rank-1 accuracy 75.43%, mAP 73.05% on CUHK03-Detected. The HBMN outperforms the CASN [64] and achieves the best result of all previous existing methods by 4.37% in Rank-1 accuracy, 7.63% in mAP, and 3.93% in Rank-1, 8.65% in mAP, respectively.

\({\bf Market-1501.}\) As illustrated in Table 4, our proposed HMBN achieves competitive results of 94.86% Rank-1 accuracy and 87.45% mAP on Market-1501. Although the MGN and PISNet outperform the HMBN by 0.84% and 0.74% in Rank-1, respectively, the HMBN clearly exceeds all existing methods in terms of mAP (87.45%).

4.4 Ablation Studies

To further verify the effectiveness of each component in the MBN, we present ablation analysis on the DukeMTMC-reID, CUHK03-Labeled and CUHK03-Detected datasets. Parameter analysis and visualization are performed on the DukeMTMC-reID and CUHK03-Labeled datasets.

\({\bf Multi-branch Structure.}\) The effectiveness of uniform partition and multi-branch structure are shown in Table 5 with the comparison of model 1, 2, 3, 4, 5, 6. B + S1B is short for a model including a base module and S1B, B + S1B + S2B (\(l\) = 0) is a multi-branch model composed of a base module, S1B, S2B, and so forth. For uniform partition in the single branch, the number of horizontal stripes controls the granularity of the local feature. With the increase of number of stripes, the performance boosts as well. However, as the number of stripes further increases, the improvement seems to be marginal but enlarges the model parameters: for example, B + S2B (\(l\) = 0) outperforms B + S1B in mAP by 7.93% but B + S3B (\(l\) = 0) outperforms B + S2B (\(l\) = 0) in mAP by 1.98% on DukeMTMC-reID. For the multi-branch structure, a model with a multi-branch structure is better than each composed branch: for example, B + S1B + S2B (\(l\) = 0) + S3B (\(l\) = 0) beats B + S1B, B + S2B (\(l\) = 0), B + S3B (\(l\) = 0), B + S1B + S2B (\(l\) = 0) in Rank-1/mAP by 6.19%/13.74%, 2.92%/5.81%, 2.02%/3.83%, and 0.76%/2.62% on DukeMTMC-reID. As the number of branches further increases, the performance increases as well but marginal: B + S1B + S3B (\(l\) = 0) outperforms B + S1B in mAP by 11.78% but B + S1B + S2B (\(l\) = 0) + S3B (\(l\) = 0) outperforms B + S1B + S2B (\(l\) = 0) in mAP by 2.62% on DukeMTMC-reID. To keep the balance between high retrieval accuracy and low model parameter, we recommend using B + S1B + S2B (\(l\) = 0) + S3B (\(l\) = 0) as the architecture to verify the effectiveness of HOP, IBAM, and HTP.

Table 5.

\({\bf Effectiveness of HOP.}\) Figure 10 shows the Rank-1 accuracy and mAP change with parameter \(l\) in HOP, in which \(l\) is the total height of overlapped areas in one stripe from one branch. To simplify the notation, \(l\) of HOP in S2B is denoted as \(l_2\) and \(l\) of HOP in S3B is denoted as \(l_3\). In the model with one branch S2B (B + S2B), as illustrated in Figure 10(a), the performance rises as well when \(l_2\) is increased, indicating that HOP is essential. However, the accuracy is not always growing with \(l_2\). In the model with one branch S3B (B + S3B), as illustrated in Figure 10(b), with the increase of \(l_3\), the accuracy show the same trend. It is noted that over-increased \(l\) helps to cover meaningful local regions between adjacent parts but diminishes the feature learning in fine-grained cue. A proper \(l\) helps to achieve a good balance between learning fine-grained information and extracting features in meaningful local regions. As is shown in Table 5, \((2, 0)\), \((2, 0)\), \((2, 2)\) are recommended for \((l_2, l_3)\) on DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets.

Fig. 10.

\({\bf Effectiveness of IBAM.}\) The comparison of models 7 and 8 in Table 5 shows the effectiveness of the IBAM. Empirically, the IBAM improves by 0.45%/0.42%, 1.79%/1.26%, and 0.93%/1.73% in Rank-1/mAP on the DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets, respectively (model 7 vs. model 8). To verify whether the IBAM builds interaction among branches, we visualize the activations of the last convolutional feature maps from three branches on DukeMTMC-reID and CUHK03-Labeled in Figure 11. It is noted that (1) S1B performs the Re-ID task at the global level, which is likely to ignore detailed information. The IBAM helps S2B and S3B interact with S1B, and S1B learns the global information with the consideration of local information. Red eclipses mark the detailed information ignored in S1B from the HMBN without the IBAM but considered by S1B from HMBN with the IBAM. (2) S2B and S3B perform the Re-ID task at the part level, which fails in learning consecutive local regions. S2B and S3B successfully keep an eye on larger local areas with the injection of IBAM. Yellow eclipses mark the consecutive local region ignored in S2B or S3B from the HMBN without the IBAM but considered by S2B or S3B from the HMBN with the IBAM.

Fig. 11.

\({\bf Effectiveness of Harder Triplet Loss.}\) The HMBN is trained with classification loss and HTP. Traditional triplet loss can be seen in the special case of HTP if \(\alpha =0\). As shown in Table 5, the HTP outperforms traditional triplet loss in Rank-1/mAP by 0.41%/0.04%, 1.78%/2.08%, and 3.14%/2.26% on the DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets, respectively (model 8 vs. model 9). Parameter analysis for \(\alpha\) in HTP is illustrated in Figure 12. Parameter \(\alpha\) controls the scale factor of anchor-to-positive distance and anchor-to-negative distance. We can see that the performance of THE HMBN is sensitive to \(\alpha\), and parameter \(\alpha\) helps the HMBN achieve a higher retrieval accuracy when \(\alpha =0.01\).

Fig. 12.

5 Conclusions

In this article, we pay attention to widely used multi-branch methods with different stripes and propose a harmonious multi-branch network for Re-ID with HTP. Unlike previous methods that design more branches or more stripes for extracting coarse-to-fine pyramid representations, we analyze how to improve feature learning in a single branch and build interaction among different branches. For feature learning in a single branch, we design the HOP to enhance representational capability in meaningful local regions while extracting fine-grained information. For the interaction among different branches, we incorporate the IBAM to refine representation within a single branch by integrating information from other branches. In addition, we analyze the deficiencies of the commonly applied triplet loss and propose the generalized triplet loss, namely, HTP. Our HTP optimizes triplets in a harder manner, further reducing and enlarging intra-class and inter-class variations. Each component is verified thoroughly in extensive ablation experiments. In addition, the HMBN achieves superior performance compared with state-of-the-art Re-ID methods. In the future, we will explore our idea of harmonious multi-branch learning in more computer vision tasks, that is, image retrieval.

References

[1]

Jiajiong Cao, Yingming Li, and Zhongfei Zhang. 2018. Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, USA, 4290–4299.

Abstract

1 Introduction

2 Related Works

2.1 Part-Based Re-ID

2.2 Attention-Based Re-ID

2.3 Metric Learning–Based Re-ID

3 Harmonious Multi-branch Network (HMBN)

3.1 Overall Architecture

3.2 Coarse-to-Fine Structure

3.3 Horizontal Overlapped Pooling (HOP)

3.4 Inter-Branch Attention Module (IBAM)

3.5 Harder Triplet Loss (HTP)

3.6 Discussions

4 Experiments

4.1 Datasets and Evaluation Protocols

4.2 Implementation Details

4.3 Comparison with State-of-the-Art Methods

4.4 Ablation Studies

5 Conclusions

References

Cited By

Index Terms

Recommendations

Person re-identification by the asymmetric triplet and identification loss function

Deep feature embedding learning for person re-identification based on lifted structured loss

Triplet Ratio Loss for Robust Person Re-identification

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations