In this section, we first describe three datasets and evaluation protocols in our experiments. Then, implementation details are introduced. Next, we compare the retrieval accuracy of the HMBN with state-of-the-art methods on these three datasets. Finally, we carry out ablation studies on DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected to verify the effectiveness of each component. Parameter analysis and visualization are also included.
4.1 Datasets and Evaluation Protocols
We conduct experiments on three popular Re-ID datasets: DukeMTMC-reID [
31,
65], CUHK03 [
21], and Market-1501 [
63]. Dataset statistics are shown in Table
1.
\({\bf DukeMTMC-reID}\) is a subset of the DukeMTMC dataset [
31] for Re-ID. It is also one of the largest datasets in the Re-ID task. It contains 36,411 images of 1,812 pedestrians captured from 8 high-resolution cameras. There are 702 pedestrians with 16,522 images randomly divided for a training set. The other 702 pedestrians are included in the testing set, in which 2,228 images and 17,661 images are included in the query set and gallery set, respectively. The remaining 408 pedestrians are distractors.
\({\bf CUHK03}\) is a relatively small dataset compared with DukeMTMC-reID. It has 1,467 pedestrians with 14,097 images captured by 6 cameras on the CUHK campus. Both manually annotated and DPM-detected bounding boxes are provided, which are denoted as CUHK03-Labeled and CUHK03-Detected. In this article, we use the setting of both of them.
\({\bf Market-1501}\) is another large dataset, which is collected by the Deformable Part Model (DPM) detector [
8] from 6 cameras. The whole dataset is separated into a training set with 12,936 of 751 pedestrians and a testing set including 3,368 query images of 750 pedestrians and 15,913 gallery images of 751 pedestrians.
We report Cumulative Matching Characteristics (CMCs) at rank-1, and the mean Average Precision (mAP) with the single-shot setting on all candidate datasets.
4.2 Implementation Details
The proposed HMBN is implemented using the Pytorch [
28] framework on a single NVIDIA GTX 1080 Ti GPU. The weights of ResNet-50 [
11] pretrained on ImageNet [
7] are adopted to initialize parameters of the HMBN.
In the training phase, we resize the input images to
\(384\times {128}\). Then, they are augmented by random horizontal flip, normalization, and random erasing [
67]. The total training phase takes 500 epochs. The initial learning rate is set to 2e-4, and then decays to 2e-5, 2e-6 after 320, 380 epochs. An Adam optimizer is used to update the weight parameters with weight decay 5e-4. The batch size is set to 32, in which each identity contains 4 images. The margin in the HTP is set to 1.2 in the following experiments. It should be emphasized that the current experimental setting is slightly different from the conference version in two ways. (1) Different batch sizes —The batch size is set to 32 in this journal version while the batch size in the conference version is 16. (2) Different machines —Experiments from the conference and journal versions are conducted on two independent machines with different hardware and software, that is, CPUs, GPUs, and operation systems. Based on different experimental settings, the results reported in this article are slightly different from the conference version.
In the testing phase, the input images are resized to \(384\times {128}\) and augmented only by normalization. All dimension-reduced (256-dim) global and local features are concatenated as the final embedding vector of a pedestrian.
4.3 Comparison with State-of-the-Art Methods
More than 10 existing state-of-the-art methods are compared with our proposed HMBN on DukeMTMC-reID, CUHK03, and Market-1501 datasets in Table
2, Table
3, and Table
4, respectively. We separate these compared methods into three groups: single-branch methods (S), multi-branch methods (M), and attention-based methods regardless of the number of branches (A).
\({\bf DukeMTMC-reID.}\) Our proposed HMBN achieves the best result of a Rank-1 accuracy of 89.86% and a mAP of 79.68% on the DukeMTMC-reID dataset. We should emphasize the following. (1) The gaps between HMBN and single-branch methods (MLFN [
2], PCB+RPP [
41], HPM [
9], OSNet [
68], and BoT [
26]) demonstrate the effectiveness of the multi-branch structure, for example, the HMBN surpasses the BoT by Rank-1/mAP = 3.46%/3.28%. (2) Multi-branch methods (PSE [
32], HA-CNN [
23], C
\(A^3\)Net [
24], CAMA [
56], HOReID [
49], MGN [
50], and PISNet [
62]) integrate complementary information (e.g., pose estimation, human parsing results, attribute information) into final pedestrian representations, for example, HOReID [
49] aligns local features with key-points estimation. Without injecting prior knowledge, such as attributes or poses, the HMBN exceeds the MGN and achieves the best results in this group, by 1.16% in Rank-1 accuracy and 1.28% in mAP. We argue that these methods ignore the interaction among branches in the multi-branch network. On the contrary, the HMBN builds interaction among branches in higher layers of the network by injecting the IBAM. (3) The HMBN outperforms the CASN [
64], which achieves the top result in attention-based methods (DuATM [
36], Mancs [
47], AANet-50 [
44], CASN [
64]), by 2.16% in Rank-1 and 5.98% in mAP. Instead of modeling intra-branch contextual dependency in attention-based methods, our designed IBAM builds inter-branch dependency. With the help of re-ranking [
66], we achieve a higher result of 92.19% in Rank-1 accuracy and 90.44% in mAP.
Figure
9 shows the top-10 ranking results with four query images on the DukeMTMC-reID dataset. Given a query image, the HMBN can retrieve the correct pedestrian under severe visual recognition problems such as view angle variations, illumination variations, and occlusion.
\({\bf CUHK03.}\) We report a clear winner case on CUHK03-Labeled and CUHK03-Detected in Table
3. The HMBN achieves the top result of Rank-1 accuracy 78.07%, mAP 75.63% on CUHK03-Labeled and Rank-1 accuracy 75.43%, mAP 73.05% on CUHK03-Detected. The HBMN outperforms the CASN [
64] and achieves the best result of all previous existing methods by 4.37% in Rank-1 accuracy, 7.63% in mAP, and 3.93% in Rank-1, 8.65% in mAP, respectively.
\({\bf Market-1501.}\) As illustrated in Table
4, our proposed HMBN achieves competitive results of 94.86% Rank-1 accuracy and 87.45% mAP on Market-1501. Although the MGN and PISNet outperform the HMBN by 0.84% and 0.74% in Rank-1, respectively, the HMBN clearly exceeds all existing methods in terms of mAP (87.45%).
4.4 Ablation Studies
To further verify the effectiveness of each component in the MBN, we present ablation analysis on the DukeMTMC-reID, CUHK03-Labeled and CUHK03-Detected datasets. Parameter analysis and visualization are performed on the DukeMTMC-reID and CUHK03-Labeled datasets.
\({\bf Multi-branch Structure.}\) The effectiveness of uniform partition and multi-branch structure are shown in Table
5 with the comparison of model 1, 2, 3, 4, 5, 6. B + S1B is short for a model including a base module and S1B, B + S1B + S2B (
\(l\) = 0) is a multi-branch model composed of a base module, S1B, S2B, and so forth. For uniform partition in the single branch, the number of horizontal stripes controls the granularity of the local feature. With the increase of number of stripes, the performance boosts as well. However, as the number of stripes further increases, the improvement seems to be marginal but enlarges the model parameters: for example, B + S2B (
\(l\) = 0) outperforms B + S1B in mAP by 7.93% but B + S3B (
\(l\) = 0) outperforms B + S2B (
\(l\) = 0) in mAP by 1.98% on DukeMTMC-reID. For the multi-branch structure, a model with a multi-branch structure is better than each composed branch: for example, B + S1B + S2B (
\(l\) = 0) + S3B (
\(l\) = 0) beats B + S1B, B + S2B (
\(l\) = 0), B + S3B (
\(l\) = 0), B + S1B + S2B (
\(l\) = 0) in Rank-1/mAP by 6.19%/13.74%, 2.92%/5.81%, 2.02%/3.83%, and 0.76%/2.62% on DukeMTMC-reID. As the number of branches further increases, the performance increases as well but marginal: B + S1B + S3B (
\(l\) = 0) outperforms B + S1B in mAP by 11.78% but B + S1B + S2B (
\(l\) = 0) + S3B (
\(l\) = 0) outperforms B + S1B + S2B (
\(l\) = 0) in mAP by 2.62% on DukeMTMC-reID. To keep the balance between high retrieval accuracy and low model parameter, we recommend using B + S1B + S2B (
\(l\) = 0) + S3B (
\(l\) = 0) as the architecture to verify the effectiveness of HOP, IBAM, and HTP.
\({\bf Effectiveness of HOP.}\) Figure
10 shows the Rank-1 accuracy and mAP change with parameter
\(l\) in HOP, in which
\(l\) is the total height of overlapped areas in one stripe from one branch. To simplify the notation,
\(l\) of HOP in S2B is denoted as
\(l_2\) and
\(l\) of HOP in S3B is denoted as
\(l_3\). In the model with one branch S2B (B + S2B), as illustrated in Figure
10(a), the performance rises as well when
\(l_2\) is increased, indicating that HOP is essential. However, the accuracy is not always growing with
\(l_2\). In the model with one branch S3B (B + S3B), as illustrated in Figure
10(b), with the increase of
\(l_3\), the accuracy show the same trend. It is noted that over-increased
\(l\) helps to cover meaningful local regions between adjacent parts but diminishes the feature learning in fine-grained cue. A proper
\(l\) helps to achieve a good balance between learning fine-grained information and extracting features in meaningful local regions. As is shown in Table
5,
\((2, 0)\),
\((2, 0)\),
\((2, 2)\) are recommended for
\((l_2, l_3)\) on DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets.
\({\bf Effectiveness of IBAM.}\) The comparison of models 7 and 8 in Table
5 shows the effectiveness of the IBAM. Empirically, the IBAM improves by 0.45%/0.42%, 1.79%/1.26%, and 0.93%/1.73% in Rank-1/mAP on the DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets, respectively (model 7 vs. model 8). To verify whether the IBAM builds interaction among branches, we visualize the activations of the last convolutional feature maps from three branches on DukeMTMC-reID and CUHK03-Labeled in Figure
11. It is noted that (1) S1B performs the Re-ID task at the global level, which is likely to ignore detailed information. The IBAM helps S2B and S3B interact with S1B, and S1B learns the global information with the consideration of local information. Red eclipses mark the detailed information ignored in S1B from the HMBN without the IBAM but considered by S1B from HMBN with the IBAM. (2) S2B and S3B perform the Re-ID task at the part level, which fails in learning consecutive local regions. S2B and S3B successfully keep an eye on larger local areas with the injection of IBAM. Yellow eclipses mark the consecutive local region ignored in S2B or S3B from the HMBN without the IBAM but considered by S2B or S3B from the HMBN with the IBAM.
\({\bf Effectiveness of Harder Triplet Loss.}\) The HMBN is trained with classification loss and HTP. Traditional triplet loss can be seen in the special case of HTP if
\(\alpha =0\). As shown in Table
5, the HTP outperforms traditional triplet loss in Rank-1/mAP by 0.41%/0.04%, 1.78%/2.08%, and 3.14%/2.26% on the DukeMTMC-reID, CUHK03-Labeled, and CUHK03-Detected datasets, respectively (model 8 vs. model 9). Parameter analysis for
\(\alpha\) in HTP is illustrated in Figure
12. Parameter
\(\alpha\) controls the scale factor of anchor-to-positive distance and anchor-to-negative distance. We can see that the performance of THE HMBN is sensitive to
\(\alpha\), and parameter
\(\alpha\) helps the HMBN achieve a higher retrieval accuracy when
\(\alpha =0.01\).