CAT: Centerness-Aware Anchor-Free Tracker
<p>Flowchart of the proposed CAT tracker. The backbone Siamese network takes exemplar image <span class="html-italic">Z</span> and search image <span class="html-italic">X</span> as input and outputs corresponding feature maps denoted as <math display="inline"><semantics> <mrow> <mi>φ</mi> <mo>(</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>φ</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math>. In order to embed the features from two branches, a depth-wise cross-correlation operation is employed to obtain the multi-channel response map denoted as <span class="html-italic">P</span>. Then, to reduce the computation, a convolution layer with a kernel size of <math display="inline"><semantics> <mrow> <mn>1</mn> <mo>×</mo> <mn>1</mn> </mrow> </semantics></math> is employed to fuse the response map. The fused response map with reduced dimension is denoted as <span class="html-italic">R</span> and is adopted as the input to the centerness-aware anchor-free network. Regarding every spatial location on the regression map <span class="html-italic">D</span>, the regression branch learns to estimate the distance from the corresponding location to each side of the ground truth bounding box. For the classification map <span class="html-italic">C</span>, with the observation that many low-quality predictions are produced corresponding to the locations far from the target center, the centerness-aware classification branch learns to output a 0 for the background and a value ranging from 0 to 1 to indicate the normalized distance between the spatial location within the foreground and the target center to suppress predictions with low quality.</p> "> Figure 2
<p>Qualitative comparisons between the proposed tracker CAT and representative trackers SiamRPN [<a href="#B8-sensors-22-00354" class="html-bibr">8</a>], SiamRPN++ [<a href="#B17-sensors-22-00354" class="html-bibr">17</a>], and SiamCAR [<a href="#B11-sensors-22-00354" class="html-bibr">11</a>] on boat3 (first row), truck1 (second row), bike1 (third row), wakeboard5 (fourth row), and car8 (bottom row) sequences that involve large scale variations and aspect ratio variations. Compared to other trackers, even facing challenging scenarios including occlusion, large scale variations, and aspect ratio variations, CAT provides accurate state estimations that significantly improve the robustness and accuracy in tracking.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Siamese Network-Based Trackers
2.2. Anchor-Free Mechanism
3. Proposed Algorithm
3.1. Feature Extraction
3.2. Feature Combination
3.3. Centerness-Aware Anchor-Free Network
3.4. Loss Function
3.5. Relationship to Prior Anchor-Free Work
4. Experiments
4.1. Implementation Details
4.1.1. Training
4.1.2. Inference
4.2. Datasets and Evaluation Metrics
4.3. State-of-the-Art Comparison
- SiamFC [5] employs the multi-scale searching scheme to perform template matching on multiple scales to identify the best scale.
- SiamCAR [11] addresses the tracking problem in an anchor-free approach.
- SiamMask [37] represents the target object as a binary segmentation mask instead of axis-aligned bounding boxes.
4.4. Component-Wise Analysis of the Proposed Method
- CAT_wo_cen: The CAT tracker without the centerness-aware classification branch is denoted as CAT_wo_cen, where a classification branch trained for binary foreground-background identification is employed instead of our proposed centerness-aware classification branch.
- CAT_w_cen_div: The CAT tracker with a separate branch trained for centerness value estimation is denoted as CAT_w_cen_div, in which a single-layer branch paralleling the binary classification branch is trained to estimate the centerness value.
- CAT_wo_mod: The CAT tracker with the proposed centerness-aware classification branch trained by the standard cross entropy loss without the modulating factor.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Li, H.; Xiezhang, T.; Yang, C.; Deng, L.; Yi, P. Secure Video Surveillance Framework in Smart City. Sensors 2021, 21, 4419. [Google Scholar] [CrossRef] [PubMed]
- Baxter, R.H.; Leach, M.J.V.; Mukherjee, S.S.; Robertson, N.M. An Adaptive Motion Model for Person Tracking with Instantaneous Head-pose Features. IEEE Signal Process. Lett. 2015, 22, 578–582. [Google Scholar] [CrossRef] [Green Version]
- Liu, X.; Lin, Z.; Acton, S.T. A Grid-based Bayesian Approach to Robust Visual Tracking. Digit. Signal Process. 2012, 22, 54–65. [Google Scholar] [CrossRef]
- Ma, H.; Acton, S.T.; Lin, Z. OSLO: Automatic Cell Counting and Segmentation for Oligodendrocyte Progenitor Cells. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 2431–2435. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. Fully-convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 850–865. [Google Scholar]
- Ma, H.; Acton, S.T.; Lin, Z. SITUP: Scale Invariant Tracking using Average Peak-to-correlation Energy. IEEE Trans. Image Process. 2020, 29, 3546–3557. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ma, H.; Lin, Z.; Acton, S.T. FAST: Fast and Accurate Scale Estimation for Tracking. IEEE Signal Process. Lett. 2019, 27, 161–165. [Google Scholar] [CrossRef]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2016; pp. 91–99. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
- Guo, Q.; Feng, W.; Zhou, C.; Huang, R.; Wan, L.; Wang, S. Learning Dynamic Siamese Network for Visual Object Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1763–1771. [Google Scholar]
- Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; Maybank, S. Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4854–4863. [Google Scholar]
- He, A.; Luo, C.; Tian, X.; Zeng, W. A Twofold Siamese Network for Real-time Object Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4834–4843. [Google Scholar]
- Wang, G.; Luo, C.; Xiong, Z.; Zeng, W. SPM-tracker: Series-parallel Matching for Real-time Visual Object Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3643–3652. [Google Scholar]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware Siamese Networks for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
- Yang, Z.; Xu, Y.; Xue, H.; Zhang, Z.; Urtasum, R.; Wang, L.; Lin, S.; Hu, H. Dense reppoints: Representing visual objects with dense point sets. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 227–244. [Google Scholar]
- Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
- Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up Object Detection by Grouping Extreme and Center Points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6569–6657. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Zhu, C.; He, Y.; Savvides, M. Feature Selective Anchor-free Module for Single-shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
- Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. FoveaBox: Beyound Anchor-Based Object Detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
- Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. VarifocalNet: An IoU-Aware Dense Object Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8514–8523. [Google Scholar]
- Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 771–787. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An Advanced Object Detection Network. In Proceedings of the ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 516–520. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 9627–9636. [Google Scholar]
- de Boer, P.T.; Kroese, D.P.; Mannor, S.; Rubinstein, R.Y. A Tutorial on the Cross-entropy Method. Ann. Oper. Res. 2005, 134, 19–67. [Google Scholar] [CrossRef]
- Lin, T.Y.; Michael, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft Coco: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Real, E.; Shlens, J.; Mazzocchi, S.; Pan, X.; Vanhoucke, V. YouTube-BoundingBoxes: A Large High-precision Human-annotated Dataset for Object Detection in Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5296–5305. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F. ImageNet: A Large-scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kristan, M.; Matas, J.; Leonardis, A.; Felsberg, M.; Pflugfelder, R.; Kamarainen, J.K.; Cehovin Zajc, L.; Drbohlav, O.; Lukezic, A.; Berg, A.; et al. The Seventh Visual Object Tracking VOT2019 Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 445–461. [Google Scholar]
- Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H.S. Fast Online Object Tracking and Segmentation: A Unifying Approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1328–1338. [Google Scholar]
SiamFC | SiamRPN | SiamMask | SiamRPN++ | SiamCAR | CAT | |
---|---|---|---|---|---|---|
AUC ↑ | 0.485 | 0.557 | 0.603 | 0.610 | 0.614 | 0.635 |
Prec. ↑ | 0.693 | 0.768 | 0.795 | 0.803 | 0.813 | 0.838 |
SiamFC | SiamRPN | SiamMask | SiamRPN++ | SiamCAR | CAT | |
---|---|---|---|---|---|---|
EAO ↑ | 0.189 | 0.272 | 0.287 | 0.285 | 0.288 | 0.317 |
A ↑ | 0.510 | 0.582 | 0.592 | 0.599 | 0.593 | 0.583 |
R ↓ | 0.958 | 0.527 | 0.461 | 0.482 | 0.451 | 0.416 |
SiamCAR | CAT | |
---|---|---|
Number of Parameters ↓ | 51,384,903 | 51,380,293 |
Speed (Frames Per Second) ↑ | 54.62 | 57.83 |
CAT_wo_cen | CAT_w_cen_div | CAT_wo_mod | CAT | |
---|---|---|---|---|
AUC ↑ | 0.480 | 0.618 | 0.595 | 0.635 |
Prec. ↑ | 0.646 | 0.805 | 0.788 | 0.838 |
CAT_wo_cen | CAT_w_cen_div | CAT_wo_mod | CAT | |
---|---|---|---|---|
EAO ↑ | 0.224 | 0.291 | 0.266 | 0.317 |
A ↑ | 0.475 | 0.590 | 0.580 | 0.583 |
R ↓ | 0.482 | 0.446 | 0.547 | 0.416 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, H.; Acton, S.T.; Lin, Z. CAT: Centerness-Aware Anchor-Free Tracker. Sensors 2022, 22, 354. https://doi.org/10.3390/s22010354
Ma H, Acton ST, Lin Z. CAT: Centerness-Aware Anchor-Free Tracker. Sensors. 2022; 22(1):354. https://doi.org/10.3390/s22010354
Chicago/Turabian StyleMa, Haoyi, Scott T. Acton, and Zongli Lin. 2022. "CAT: Centerness-Aware Anchor-Free Tracker" Sensors 22, no. 1: 354. https://doi.org/10.3390/s22010354
APA StyleMa, H., Acton, S. T., & Lin, Z. (2022). CAT: Centerness-Aware Anchor-Free Tracker. Sensors, 22(1), 354. https://doi.org/10.3390/s22010354