Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks
<p>The overview diagram of dual-stream tracking pipeline integrated with Macaron Attention. It comprises a tracking backbone, a Macaron tracking neck, and a tracking head. Macaron Attention is specifically designed to address challenges related to scale variation and limited perspective in UAV targets, utilizing local squeezing global attention to effectively tackle these issues.</p> "> Figure 2
<p>The overview of fixed window attention (FWA). It partitions the feature tokens into non-overlapping small patches and implements attention within each local region. The red rectangle means the non-overlap division, the circle with the black rectangle is the tokenized features, and the red dash represents the combination of tokenized features.</p> "> Figure 3
<p>The overview of LSGA. Local attention and global attention are directly combined to realize efficient information interaction. It includes the cluster-finding block, local–global squeezing block, and resetting block. The circle with the rectangle means the tokens like Swin and CSwin.</p> "> Figure 4
<p>The overview of conventional global attention. Only global attention is taken into account to enrich the global information, where query, key, and value all occupy the whole feature token.</p> "> Figure 5
<p>The overall performance comparison of our tracker with success and precision.</p> "> Figure 6
<p>Precision and success on UAV123 dataset in similar object, partial occlusion, background clutters, and viewpoint change.</p> "> Figure 7
<p>Precision and success on UAV123 dataset in Out-of-view, illumination variation, low resolution, and camera motion.</p> "> Figure 8
<p>Precision and success in fast motion, scale variation, full occlusion, and aspect change.</p> "> Figure 9
<p>The quality analysis on visualization. It includes the comparison of the bounding box, our outcomes, Transt, TrDiMP, SiamCAR, and ATOM.</p> ">
Abstract
:1. Introduction
- We introduce Macaron Attention into the tracking pipeline named MATrack, effectively addressing the challenges posed by the limited UAV perspective and drastic changes in object scale.
- We incorporate the Macaron Attention mechanism into the tracking neck, integrating fixed window attention (FWA), local squeezing global attention (LSGA), and conventional global attention (CGA). The “window-to-window” alignment strategy considers both global and local interactions, as well as scale changes.
- Comprehensive evaluation is conducted on our tracking pipeline on UAV123 and UAV20L. Additionally, GOT-10K is also used for the general scenarios. It proves that our pipeline achieves SOTA performance with acceptable inference speed.
2. Related Works
3. Method
3.1. Overview
3.2. Tracking Backbone
3.3. Macaron Attention Realization
3.3.1. Fixed Window Attention
3.3.2. Local Squeezing Global Attention
3.3.3. Cluster-Finding Block
3.3.4. Local–Global Squeezing Block
3.3.5. Resetting Block
3.3.6. Conventional Global Attention
3.4. Tracking Head
4. Experiments
4.1. Tracker Comparison
4.1.1. UAV123 Benchmark
4.1.2. UAV123 Benchmark
4.1.3. UAV20L Benchmark
4.1.4. GOT-10k Benchmark
4.2. Visualization
4.3. Limitation and Future Expectancy
4.4. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 5–20 June 2019; pp. 4282–4291. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
- Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Xu, Q.; Deng, H.; Zhang, Z.; Liu, Y.; Ruan, X.; Liu, G. A ConvNeXt-based and feature enhancement anchor-free Siamese network for visual tracking. Electronics 2022, 11, 2381. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning. PMLR, Online, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Wang, Z.; Yao, J.; Tang, C.; Zhang, J.; Bao, Q.; Peng, Z. Information-diffused graph tracking with linear complexity. Pattern Recognit. 2023, 143, 109809. [Google Scholar] [CrossRef]
- Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 5–20 June 2019; pp. 4591–4600. [Google Scholar]
- Deng, A.; Han, G.; Chen, D.; Ma, T.; Liu, Z. Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking. Remote Sens. 2023, 15, 2857. [Google Scholar] [CrossRef]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
- Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 146–164. [Google Scholar]
- Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6668–6677. [Google Scholar]
- Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6269–6277. [Google Scholar]
- Zheng, Z.; Wan, Y.; Zhang, Y.; Xiang, S.; Peng, D.; Zhang, B. CLNet: Cross-layer convolutional neural network for change detection in optical remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 247–267. [Google Scholar] [CrossRef]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
- Chen, B.; Li, P.; Bai, L.; Qiao, L.; Shen, Q.; Li, B.; Gan, W.; Wu, W.; Ouyang, W. Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 375–392. [Google Scholar]
- Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
- Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 771–787. [Google Scholar]
- Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 13444–13454. [Google Scholar]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
- Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
- Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8731–8740. [Google Scholar]
- Fu, Z.; Fu, Z.; Liu, Q.; Cai, W.; Wang, Y. SparseTT: Visual Tracking with Sparse Transformers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 905–912. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
- Wang, Z.; Zhou, G.; Yao, J.; Zhang, J.; Bao, Q.; Hu, Q. Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos. Remote Sens. 2024, 16, 748. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and better learning for bounding box regression. arXiv 2020, arXiv:1911.08287. [Google Scholar] [CrossRef]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 5–20 June 2019; pp. 5374–5383. [Google Scholar]
- Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
- Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
- Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6728–6737. [Google Scholar]
- Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
Md | SiamITL | SiamEMT | ParallelTracker | SiamPT | Ours | |
---|---|---|---|---|---|---|
Mt | ||||||
Success | 62.5 | 62.7 | 69.2 | 69.4 | 71.0 | |
Precision | 81.8 | 81.9 | 90.5 | 89.0 | 91.1 | |
Inference Speed(FPS) | 193 | 25 | 30 | 91 | 25 | |
Platform | RTX3090 | RTX3090 | RTX2070 | RTX3090 | RTX3090 | |
Parameters | 65.4 M | 71.2 M | 47.6 M | 32.8 M | 22.3 M |
Md | SGDViT | SiamRPN++ | SiamAPN | SiamAPN++ | SiamPT | Ours | |
---|---|---|---|---|---|---|---|
Mt | |||||||
Success | 51.9 | 57.9 | 51.8 | 53.3 | 65.3 | 69.4 | |
Precision | 69.2 | 75.8 | 69.2 | 70.3 | 84.8 | 89.1 |
Md | AutoMatch | SBT | SLT | STARK | TransT | OSTrack | SiamPT | Ours | |
---|---|---|---|---|---|---|---|---|---|
Mt | |||||||||
AO | 65.2 | 70.4 | 67.5 | 68.8 | 67.1 | 71.0 | 72.5 | 71.8 | |
SR0.5 | 76.6 | 80.8 | 76.5 | 78.1 | 76.8 | 80.4 | 82.7 | 81.7 | |
SR0.75 | 54.3 | 64.7 | 60.3 | 64.1 | 60.9 | 68.2 | 67.0 | 69.2 |
No. | FWA | LSGA | SR | Inference Speed (FPS) | |||
---|---|---|---|---|---|---|---|
1 | ✓ | ✕ | ✕ | ✓ | ✕ | 68.2 | 49.8 |
2 | ✕ | ✓ | ✕ | ✓ | ✕ | 69.4 | 27.9 |
3 | ✓ | ✓ | ✓ | ✕ | ✕ | 70.2 | 25.2 |
4 | ✓ | ✓ | ✕ | ✓ | ✕ | 71.0 | 23.1 |
5 | ✓ | ✓ | ✕ | ✕ | ✓ | 70.3 | 21.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Z.; Luo, H.; Liu, D.; Li, M.; Liu, Y.; Bao, Q.; Zhang, J. Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks. Remote Sens. 2024, 16, 2896. https://doi.org/10.3390/rs16162896
Wang Z, Luo H, Liu D, Li M, Liu Y, Bao Q, Zhang J. Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks. Remote Sensing. 2024; 16(16):2896. https://doi.org/10.3390/rs16162896
Chicago/Turabian StyleWang, Zhixing, Hui Luo, Dongxu Liu, Meihui Li, Yunfeng Liu, Qiliang Bao, and Jianlin Zhang. 2024. "Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks" Remote Sensing 16, no. 16: 2896. https://doi.org/10.3390/rs16162896
APA StyleWang, Z., Luo, H., Liu, D., Li, M., Liu, Y., Bao, Q., & Zhang, J. (2024). Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks. Remote Sensing, 16(16), 2896. https://doi.org/10.3390/rs16162896