Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos
<p>Comparison of different tracking pipelines of Transformer-based trackers. (<b>a</b>) is the conventional tracking pipeline. (<b>b</b>) is the novel prompting tracking pipeline (CVPR2023). (<b>c</b>) is our proposed tracking pipeline with a self-prompting mechanism.</p> "> Figure 2
<p>The overview of SiamPT with the CNN backbone, Transformer neck, and Tracker head stages. An efficient ConvNeXt backbone is used as the foundation of our tracker. Within the Neck block, the Prompter Generation Module (PGM) is responsible for the extraction of prompters from the global attention mechanism. The Feature Division Module (FDM) is to categorize tokens into different classifications, ensuring the ability to distinguish targets and background interference.</p> "> Figure 3
<p>The overview of Transformer Neck. It includes the details of our proposed FDM and PGM. The different colors denote the different clustering regions in the feature map.</p> "> Figure 4
<p>The overview of attention structure in SiamPT where the Multi-Head Attention (MHA), Feed-Forward Network (FFN), Feature Division Module (FDM), and the Prompter Generation Module (PGM) are involved.</p> "> Figure 5
<p>The process to generate the prompter. The local attention and global attention can guide each other to form a prompter. Similarly, the different colors denote the different clustering regions in the feature map.</p> "> Figure 6
<p>Comparison on UAV123 dataset in overall performance. It indicates the success rate and precision of our proposed SiamPT.</p> "> Figure 7
<p>Comparison on UAV123 dataset in different attributes. Camera Motion, Aspect Ratio Change, and Full Occlusion are included.</p> "> Figure 8
<p>Comparison on UAV123 dataset in different attributes. Scale Variation, Viewpoint Change, and Low Resolution are included.</p> "> Figure 9
<p>Comparison on UAV123 dataset in different attributes. Illumination Variation, Fast Motion, and Out-of-View are included. Especially, in these attributes, our model is not in the leading position.</p> "> Figure 10
<p>The visualization outcomes conducted on UAV123 dataset.</p> ">
Abstract
:1. Introduction
- We propose a fast and efficient UAV tracking framework with a self-prompting algorithm, effectively striking a balance between tracking speed and accuracy. To the best of our knowledge, this represents the first effort to define a prompter exclusively based on single-branch features.
- We introduce an innovative division strategy distinguishing the source of the prompter and the source of prompted features. The global attention mechanisms serve as the source of the prompter, while the local attention mechanisms act as the prompted features. This approach introduces a novel paradigm for the fusion of local and global information in a rapid fashion.
- SiamPT undergoes comprehensive evaluation on well-established UAV tracking benchmarks: UAV123 and UAV20L. Furthermore, we validate its performance on the general dataset, GOT-10K. SiamPT not only achieves state-of-the-art results on these benchmarks but also excels in terms of rapid inference in the tracking domain.
2. Related Works
3. Preliminary
4. Methods
4.1. CNN-Based Backbone
4.2. Self-Prompting Realization
4.2.1. Encoder-Decoder Pipeline
4.2.2. Feature Division Module (FDM)
4.2.3. Prompter Generation Module (PGM)
4.2.4. Prompter Integration Strategy (PIS)
4.3. Double Head Layer
5. Experimental Results and Analysis
5.1. State-of-the-Arts Comparison
5.1.1. UAV123 Benchmark
5.1.2. UAV20L benchmark
5.1.3. GOT-10k Benchmark
5.1.4. RPN Benchmarks
5.2. Visualization
5.3. Limitations and Future Work
5.4. Ablation Study
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Choi, J.; Yeum, C.M.; Dyke, S.J.; Jahanshahi, M.R. Computer-aided approach for rapid post-event visual evaluation of a building façade. Sensors 2018, 18, 3017. [Google Scholar] [CrossRef] [PubMed]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
- Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
- Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
- Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 771–787. [Google Scholar]
- Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
- Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
- Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6728–6737. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
- Wang, Z.; Yao, J.; Tang, C.; Zhang, J.; Bao, Q.; Peng, Z. Information-diffused graph tracking with linear complexity. Pattern Recognit. 2023, 143, 109809. [Google Scholar] [CrossRef]
- Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 146–164. [Google Scholar]
- Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
- Zhu, J.; Lai, S.; Chen, X.; Wang, D.; Lu, H. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9516–9526. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
- Deng, A.; Han, G.; Chen, D.; Ma, T.; Liu, Z. Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking. Remote Sens. 2023, 15, 2857. [Google Scholar] [CrossRef]
- Li, S.; Fu, C.; Lu, K.; Zuo, H.; Li, Y.; Feng, C. Boosting UAV tracking with voxel-based trajectory-aware pre-training. IEEE Robot. Autom. Lett. 2023, 8, 1133–1140. [Google Scholar] [CrossRef]
- Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783. [Google Scholar]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
- Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
- Martin, D.; Goutam, B.; Gladh, S.; Khan, F.S.; Felsberg, M. Deep motion and appearance cues for visual tracking. Pattern Recognit. Lett. 2019, 124, 74–81. [Google Scholar] [CrossRef]
- Danelljan, M.; Gool, L.V.; Timofte, R. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7183–7192. [Google Scholar]
- Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–799. [Google Scholar]
- Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13444–13454. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
- Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8731–8740. [Google Scholar]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
- Fu, Z.; Fu, Z.; Liu, Q.; Cai, W.; Wang, Y. SparseTT: Visual Tracking with Sparse Transformers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 905–912. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
- Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
- Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
- Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3086–3092. [Google Scholar]
SiamSTM | SiamITL | ParallelTracker | Ours | |
---|---|---|---|---|
Success Rate | 0.618 | 0.625 | 0.692 | 0.694 |
Precision | 0.809 | 0.818 | 0.905 | 0.890 |
Inference Speed (FPS) | 193 | 32 | 25 | 91 |
Platform (GPU) | RTX3090 | RTX3090 | RTX2070 | RTX3090 |
Parameters (MB) | 31.1 | 65.4 | 47.6 | 32.8 |
SiamITL | SiamFC++ | SiamBAN | SiamAPN++ | SiamCAR | SESiamFC | Ours | |
---|---|---|---|---|---|---|---|
Success | 0.588 | 0.575 | 0.564 | 0.533 | 0.523 | 0.453 | 0.653 |
Precision | 0.769 | 0.742 | 0.736 | 0.703 | 0.687 | 0.648 | 0.848 |
SparseTT | TransT | DTT | TrDiMP | DiMP | SiamR-CNN | Ours | |
---|---|---|---|---|---|---|---|
AO | 0.693 | 0.671 | 0.634 | 0.671 | 0.611 | 0.649 | 0.725 |
SR0.5 | 0.791 | 0.768 | 0.749 | 0.777 | 0.717 | 0.738 | 0.827 |
SR0.75 | 0.638 | 0.609 | 0.514 | 0.583 | 0.492 | 0.597 | 0.670 |
SiamRPN | SiamRPN++ | SiamBAN | SiamCAR | SiamSTM | Ours | |
---|---|---|---|---|---|---|
Success | 0.557 | 0.610 | 0.631 | 0.614 | 0.647 | 0.694 |
Precision | 0.710 | 0.752 | 0.833 | 0.760 | — | 0.890 |
NO. | PGM | FDM | Overall (SR) | Inference Speed (FPS) | Model Size in Neck (MB) |
---|---|---|---|---|---|
1 | × | × | 0.680 | 122.1 | 6.3 |
2 | × | ✓ | 0.677 | 98.2 | - |
3 | ✓ | × | 0.684 | 118.0 | - |
4 | ✓ | ✓ | 0.694 | 91.0 | 6.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Z.; Zhou, G.; Yao, J.; Zhang, J.; Bao, Q.; Hu, Q. Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos. Remote Sens. 2024, 16, 748. https://doi.org/10.3390/rs16050748
Wang Z, Zhou G, Yao J, Zhang J, Bao Q, Hu Q. Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos. Remote Sensing. 2024; 16(5):748. https://doi.org/10.3390/rs16050748
Chicago/Turabian StyleWang, Zhixing, Gaofan Zhou, Jinzhen Yao, Jianlin Zhang, Qiliang Bao, and Qintao Hu. 2024. "Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos" Remote Sensing 16, no. 5: 748. https://doi.org/10.3390/rs16050748
APA StyleWang, Z., Zhou, G., Yao, J., Zhang, J., Bao, Q., & Hu, Q. (2024). Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos. Remote Sensing, 16(5), 748. https://doi.org/10.3390/rs16050748