Open AccessArticle

Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks

Zhixing Wang

^1,2,3,4,5,

Hui Luo

^1,3

Dongxu Liu

^1,3

Meihui Li

^1,3

Yunfeng Liu

^1,3,

Qiliang Bao

^1,3 and

Jianlin Zhang

^1,3,*

Institute of Optics and Electronics, Chinese Academy of Sciences, Beijing 100045, China

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

Key Laboratory of Optical Engineering, Chinese Academy of Sciences, Beijing 100045, China

⁴

School of Electronic, Electrical and Communication Engineering, Chinese Academy of Sciences, Beijing 100049, China

⁵

National Key Laboratory of Optical Field Manipulation Science and Technology, Chinese Academy of Sciences, Beijing 100045, China

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2896; https://doi.org/10.3390/rs16162896

Submission received: 16 April 2024 / Revised: 31 July 2024 / Accepted: 31 July 2024 / Published: 8 August 2024

Download

Browse Figures

Figure 1
The overview diagram of dual-stream tracking pipeline integrated with Macaron Attention. It comprises a tracking backbone, a Macaron tracking neck, and a tracking head. Macaron Attention is specifically designed to address challenges related to scale variation and limited perspective in UAV targets, utilizing local squeezing global attention to effectively tackle these issues. "> Figure 2
The overview of fixed window attention (FWA). It partitions the feature tokens into non-overlapping small patches and implements attention within each local region. The red rectangle means the non-overlap division, the circle with the black rectangle is the tokenized features, and the red dash represents the combination of tokenized features. "> Figure 3
The overview of LSGA. Local attention and global attention are directly combined to realize efficient information interaction. It includes the cluster-finding block, local–global squeezing block, and resetting block. The circle with the rectangle means the tokens like Swin and CSwin. "> Figure 4
The overview of conventional global attention. Only global attention is taken into account to enrich the global information, where query, key, and value all occupy the whole feature token. "> Figure 5
The overall performance comparison of our tracker with success and precision. "> Figure 6
Precision and success on UAV123 dataset in similar object, partial occlusion, background clutters, and viewpoint change. "> Figure 7
Precision and success on UAV123 dataset in Out-of-view, illumination variation, low resolution, and camera motion. "> Figure 8
Precision and success in fast motion, scale variation, full occlusion, and aspect change. "> Figure 9
The quality analysis on visualization. It includes the comparison of the bounding box, our outcomes, Transt, TrDiMP, SiamCAR, and ATOM. ">

Versions Notes

Abstract

The Unmanned Aerial Vehicle (UAV) tracking tasks find extensive utility across various applications. However, current Transformer-based trackers are generally tailored for diverse scenarios and lack specific designs for UAV applications. Moreover, due to the complexity of training in tracking tasks, existing models strive to improve tracking performance within limited scales, making it challenging to directly apply lightweight designs. To address these challenges, we introduce an efficient attention mechanism known as Macaron Attention, which we integrate into the existing UAV tracking framework to enhance the model’s discriminative ability within these constraints. Specifically, our attention mechanism comprises three components, fixed window attention (FWA), local squeezing global attention (LSGA), and conventional global attention (CGA), collectively forming a Macaron-style attention implementation. Firstly, the FWA module addresses the multi-scale issue in UAVs by cropping tokens within a fixed window scale in the spatial domain. Secondly, in LSGA, to adapt to the scale variation, we employ an adaptive clustering-based token aggregation strategy and design a “window-to-window” fusion attention model to integrate global attention with local attention. Finally, the CGA module is applied to prevent matrix rank collapse and improve tracking performance. By using the FWA, LSGA, and CGA modules, we propose a brand-new tracking model named MATrack. The UAV123 benchmark is the major evaluation dataset of MATrack with 0.710 and 0.911 on success and precision, individually.

Keywords:

UAV tracking; local attention; fixed window attention

1. Introduction

Visual object tracking (VOT) in remote sensing is identifying and predicting objects within a video sequence, particularly those associated with Unmanned Aerial Vehicles (UAVs). The task entails predicting the position of an object using a bounding rectangle, based on its initial appearance and location. In the UAV field, visual target tracking specifically involves extracting the features of moving targets from a low-altitude aerial perspective in video footage captured by a camera mounted on a drone. Unlike other conventional tracking scenarios, the changes in box size and aspect ratio relative to the initial frame are particularly significant from a low-altitude perspective. In these scenarios, the tracking target can be another moving UAV or another object in a bird’s-eye view. Moreover, unlike detection tasks, UAV tracking tasks rely only on the ground truth of the target in the initial frame of a video and predict the position and size of the moving target in subsequent frames based on the extracted target features. This process is target agnostic, as the initial target can be any given object. This capability finds broad applications including military reconnaissance, marine navigation, and autonomous vehicles. Despite significant progress in addressing challenges such as object-scale variations and occlusion, current trackers still encounter difficulties with UAV objects. Therefore, the necessity to design specialized networks to effectively address UAV-specific challenges warrants careful consideration.

To address the aforementioned challenges, deep learning methods have been extensively utilized in visual object tracking (VOT). Classic VOT algorithms can be broadly categorized into two paradigms: Siamese trackers and Vision Transformer (ViT) trackers. Siamese trackers, in particular, can be further divided into dual-stream and single-stream trackers. Dual-stream trackers typically consist of three main components: backbones, necks, and heads. The backbone’s primary function is to pull initial features from both branches. A variety of backbone architectures have been employed in these trackers, ranging from traditional ones like AlexNet [1] in SiamFC [2], ResNet [3] in the SiamRPN series [4,5], and GoogleNet [6] in SiamFC++ [7], to more recent ones such as Swin Transformer in Swintrack [8], ConvNEXT [9] in CAFSN [10], and EfficientNet [11] in IDGtrack [12]. The performance of the backbone serves as a crucial determinant of overall tracker performance, as it lays the foundation for feature extraction and representation.

Regarding the tracking neck, it plays a crucial role in facilitating information fusion from the two branches within the Siamese architecture. Various interaction methods have emerged to achieve more effective information exchange. Traditional methods [7,13,14] primarily rely on Depth-wise Cross Correlation (DW-Cor), which employs depth-wise convolution to compute correlation results channel by channel. Contrarily, recent Transformer-based methods [12,15,16] replace DW-Cor with attention mechanisms. This transition broadens the model’s information interaction capability to distinguish actual objects from the background, consequently enhancing overall tracking performance. As for the tracking head, its objective is to classify positive and negative samples and estimate their positions in rectangle forms. Anchor designs are also incorporated in this component. For example, SiamBAN [17] and SiamCAR [18] directly predict the distance difference between the estimated positions and the given bounding boxes to improve localization accuracy. CLNet [19] employs a latent encoder to derive a compact feature representation based on the statistical information. These dual-stream approaches often assign interactive capabilities to the neck layer, giving the network flexible expansion capabilities.

Recently, there has been a tide in interest in single-stream trackers, largely due to their streamlined integrated structure and the superior performance of Vision Transformers (ViTs). For instance, STARK, Mixformer, and SimTrack [15,20,21] have emerged as notable examples in this domain. These trackers avoid the dual-stream correlation operation in the tracking neck to form a more concise structure. Notably, STARK [20] concatenates the image inputs from both branches before sending them into the Transformer backbone, facilitating the information interaction at every Transformer layer. Mixformer [22] introduced the Mixed Attention Module to modify the plain ViT as its single-stream backbone, while SimTrack [21] unifies the dual-stream tracking networks in ViT. Despite the considerable advancements in leveraging attention mechanisms, current trackers still face limitations in effectively handling specialized objects such as UAVs.

From our perspective, two crucial factors should be emphasized when dealing with UAV objects: (i) limited UAV perspective and motion blur, and (ii) drastic changes in object scale. As a result, we opt for dual-stream trackers due to their flexibility and primarily focus on modifying the tracking neck to accommodate these factors. To deal with the above factors, we propose an attention realization called Macaron Attention, integrated into the current dual-branch tracking structure (MATrack). MATrack is used to hierarchically consider the scale changes of UAVs and enhance the overall tracking performance. Specifically, the Macaron Attention mechanism integrates fixed window attention (FWA), local squeezing global attention (LSGA), and conventional global attention (CGA). The FWA module tackles the multi-scale issue in UAVs by cropping tokens within a fixed window scale in the spatial domain. The LSGA module, on the other hand, adapts to scale variation by employing an adaptive clustering-based token aggregation strategy and designing a “window-to-window” alignment strategy to integrate global attention with local attention. Finally, the module is applied to keep the Transformer’s original global modeling capability. Notably, our major contribution lies in proposing the Macaron Attention mechanism, which enhances feature representation. This innovation can potentially extend our algorithm’s application to other fields involving attention mechanisms. The key advantage is its ability to address multi-scale issues in a clustering-based approach, making it a general solution for target-related tasks such as classification, object detection, keypoint estimation, and so on.

We validate the efficacy of Macaron Attention on the basic UAV dataset of UAV123 and the long-term one UAV20L, as well as on GOT-10K for general tracking scenarios. It proves that our pipeline achieves SOTA (state-of-the-art) tracking performance of 23 FPS, which offers the potential for deployment on real tracking platforms.

Here, we demonstrate our contributions:

We introduce Macaron Attention into the tracking pipeline named MATrack, effectively addressing the challenges posed by the limited UAV perspective and drastic changes in object scale.
We incorporate the Macaron Attention mechanism into the tracking neck, integrating fixed window attention (FWA), local squeezing global attention (LSGA), and conventional global attention (CGA). The “window-to-window” alignment strategy considers both global and local interactions, as well as scale changes.
Comprehensive evaluation is conducted on our tracking pipeline on UAV123 and UAV20L. Additionally, GOT-10K is also used for the general scenarios. It proves that our pipeline achieves SOTA performance with acceptable inference speed.

2. Related Works

Dual-stream trackers. The Siamese tracking pipelines are designed to treat tracking tasks as similarity-matching problems, employing a dual-branch structure comprising template and search regions. The template regions primarily focus on capturing the intrinsic characteristics of the object, while the search regions aim to discriminate between the object and the background. Early works like SiamFC [2], SiamRPN [5], and SiamRPN++ [4] laid the groundwork for the Siamese tracking pipeline, which typically consists of the tracking backbone, tracking neck, and tracking head. In SiamFC [2], a correlation-based tracking neck is introduced to define the tracking task and incorporate multi-scale metrics. SiamRPN [5] integrates region proposal networks (RPNs) from Faster R-CNN [23] into the tracking head to estimate the object’s center point using a rectangle defined by its width and height. SiamRPN++ [4] further enhances the approach by incorporating advanced backbones like ResNet and implementing a novel sampling strategy to prevent points from clustering excessively around the center. Moreover, DaSiamRPN [24] addresses the issue of distractor interference by introducing learnable distractor-aware features to suppress distractors effectively. Anchor-free tracking pipelines, such as Ocean, SiamCAR, and SiamFC++ [7,18,25], eliminate the anchor-based predefined boxes in sampling the positive points while estimating the difference among the center point and bounding position object’s given rectangle and predicted boxes.

Expanding on the extensive information interaction of Transformers, Siamese trackers have recently integrated Transformer-based models due to their impressive performance. For instance, TransT [15] introduced MHA (multi-head attention) blocks and MCA (multi-head cross attention) blocks, significantly enhancing the interaction ability between the two branches. KeepTrack [26] introduced a learned association network to tackle similarity inference. Meanwhile, online trackers such as Discrete Cosine Transform (DCF) [27], Kernel Correlation Filter (KCF) [28], ATOM [29], and DiMP [30] employ dynamic modeling techniques to learn object characteristics through continuous online learning adaptively. To leverage temporal information effectively, ToMP [31] integrates a Transformer encoder–decoder module with DCF to grasp temporal divergences dynamically. SparseTT [32] applies the Top-K algorithm to enhance discrimination ability which reduces the model complexity and increases the inference speed. Nevertheless, current Transformer trackers lack a unique design for UAV scenarios to solve the two crucial factors mentioned in the Introduction.

Single-stream trackers. Single-stream trackers make the tracking pipeline more concise by introducing a unified backbone to replace the separate tracking backbone and neck in dual-stream trackers. This approach treats the template and search regions as a unified input, concatenating them before inputting them into the backbone. STARK [20] adopts a similar structure to DETR [33], concatenating template and search regions before the DETR encoder. It also incorporates a random sequence to predict the bounding box before the DETR decoder. OSTrack [34] aims to sparsify the ViT structure by retaining 70% of the sub-patches in a given training epoch to enhance inference speed. SimTrack [21] pioneers the integration of ViT into visual tracking and discards the tracking heads. MixFormer [22] proposes a modified ViT using the Mixed Attention Module to form the single-stream backbone. However, single-stream trackers often entail higher computational complexity due to the attention mechanism, and their designs have not been tailored to the UAV dataset.

We begin by providing an overview of MATrack. Then, the details of our Macaron Attention integrated into the tracking framework are introduced, comprising three main modules: feature extraction backbone, Macaron tracking neck, and tracking head. Within the Macaron tracking neck, we elaborate on the three components of Macaron Attention: fixed window attention (FWA), local squeezing global attention (LSGA), and conventional global attention (CGA).

3. Method

3.1. Overview

We propose a novel attention realization, termed Macaron Attention, integrated into the tracker MATrack based on a dual-branch framework as illustrated in Figure 1. Here, C denotes the channel dimension, H represents the height dimension, and W indicates the width dimension. Additionally, Q, K, and V refer to the query, key, and value components within the attention mechanism. Our structure consists of two streams: the upper stream accepts the template image as input, while the lower stream processes the search image. Both the template and search images are extracted from the same video sequence. The objective is to use the template image to match with the search images and identify the target with the highest correlation. Initially, given the dual branches’ input frames, we employ an efficient CNN-based backbone for feature extraction using the tiny EfficientNet-V2 architecture without MAE pretraining. Once the backbone features are obtained, we introduce Macaron Attention, wherein fixed window attention (FWA) is first applied to achieve linear attention realization in the form of local regions. FWA primarily addresses scale changes in small samples, particularly for UAV objects. Subsequently, the local features are restored to their original scale size and fed into the local squeezing global attention (LSGA), where global features and adaptively aggregated local features are fused at the scale level. Additionally, conventional global attention (CGA) is employed to maintain the Transformer’s original global modeling capability. Finally, the MLP head is utilized to predict the center point with the rectangle’s height H and width W.

3.2. Tracking Backbone

To realize a UAV proprietary-based tracking network, a dual-stream tracker structure is utilized. The template input T and search input S are resized into

T : R^{3 \times H_{t} \times W_{t}}

and

S : R^{3 \times H_{s} \times W_{s}}

, respectively, within a dual-branch structure. The default dimensions are set to

H_{t} = 128

and

H_{s} = 256

. Subsequently, both T and S are inputted into a dual branch backbone. Our choice of tracking backbone is EfficientNet-V2, with its last two layers removed, consistently demonstrating superior performance with lower model complexity compared to recent popular alternatives.

There are five blocks we use in our modified EfficientNet-V2. Starting with an input size of

R^{C \times H \times W}

in the first block (C is the channel dimension, H is the height dimension, and W is the width dimension), the output features’ dimensions follow the subsequent blocks range:

R^{2 C \times \frac{H}{2} \times \frac{W}{2}}

R^{\frac{8}{3} C \times \frac{H}{4} \times \frac{W}{4}}

R^{\frac{16}{3} C \times \frac{H}{4} \times \frac{W}{4}}

, and

R^{\frac{20}{3} C \times \frac{H}{4} \times \frac{W}{4}}

, respectively. Subsequently, in conjunction with SiamPT [35], we introduce a Cross-Block Aggregation (CBA) module to adaptively fuse the outputs from the last two blocks. To merge these representations, a 2D-convolution operation is applied to unify the channel dimensions from

R^{\frac{16}{3} C \times \frac{H}{4} \times \frac{W}{4}}

and

R^{\frac{20}{3} C \times \frac{H}{4} \times \frac{W}{4}}

into

R^{\bar{C} \times \frac{H}{4} \times \frac{W}{4}}

(the default number of

\bar{C}

is 256). Following this, an adaptive fusion operation (weighted sum with learnable parameters

α

and

β

) is employed to merge the aforementioned features from the last two blocks.

3.3. Macaron Attention Realization

In implementing the Macaron Attention mechanism, three components are employed, namely, fixed window attention (FWA), local squeezing global attention (LSGA), and conventional global attention (CGA), all integrated within the Macaron tracking neck as depicted in Figure 2. After extracting features from the dual branches, represented by

R^{\bar{C} \times \frac{H}{4} \times \frac{W}{4}}

, only the search branch features undergo these components, while features from the template branch undergo the twice CGA. This strategy aims to reduce computational costs in the dual-branch Siamese structure.

3.3.1. Fixed Window Attention

As illustrated in Figure 2, to address the limited perspective of UAVs and the significant variations in object scale, we adopt a self-attention mechanism within fixed local windows. These red windows are arranged to divide the features in a non-overlapping manner. Assuming the initial dimension is

R^{C \times H \times W}

(C is the input feature dimension) and the division scale is denoted as L, each window will contain

\frac{H}{L} \times \frac{W}{L}

patches (

H = W

), resulting in a dimension of

R^{C \times \frac{H}{L} \times \frac{W}{L}}

. The fixed window attention (FWA) also involves computations among queries, keys, and values, which can be mathematically expressed as:

F W A (Q_{L}, K_{L}, V_{L}) = softmax (\frac{Q_{L} {K_{L}}^{T}}{\sqrt{d_{k}}}) V_{L}

(1)

where

Q_{L}

K_{L}

, and

V_{L}

are query, key, and value after linear mapping in the local region, respectively.

d_{k}

denotes the channel dimension value of the key. The default value of L is set to 4. Notice that the computational cost within FWA is

O ({(\frac{H W}{L})}^{2} d)

, which is distinct from the quadratic complexity of global attention concerning the number of tokens. Consequently, given the fixed size of L, such an attention implementation can be viewed as linear attention realization. However, local window attention primarily focuses on capturing local information, posing a significant challenge to effectively extract features globally. Moreover, various advanced models have addressed this limitation by introducing innovative techniques such as the shifting operation (e.g., Swin Transformer) or incorporating multi-scale local windows (e.g., CSwin Transformer). While these algorithms prove advantageous for designing the backbone, the challenge arises when addressing the neck layer. Considering that stacking numerous attention layers may introduce significant computational overhead, striking a balance between computational efficiency and tracking accuracy becomes paramount. Hence, we introduce a solution that integrates both local and global attention directly into the neck layer, employing a fixed number of layers N, with the default value set to 2.

3.3.2. Local Squeezing Global Attention

To facilitate information interaction between local and global perspectives, we introduce the local squeezing global attention (LSGA), comprising the cluster-finding block, local–global squeezing block, and resetting block. Unlike the backbone design in Swin and CSwin, which must maintain linearity throughout, the fixed layer count of Macaron Attention allows for a more efficient direct implementation of local–global interaction.

3.3.3. Cluster-Finding Block

The top left of Figure 3 demonstrates the clustering-finding block, where overlapped regions are cropped for the sequencing process. Inspired by Swin Transformer [36], the fixed window attention yields limited performance, akin to a static process. Therefore, we introduce the cluster-finding block to enable a dynamic regional attention division. This block facilitates the division of attention into smaller token regions, thereby achieving local attention with dynamic variation across token regions, similar to a multi-scale approach like CSwin. Importantly, as the attention process occurs solely within each divided region, it remains a linear process, thus conserving computational costs within these regions.

Following SiamPT [35], to realize the cluster-finding block (CFB), an overlapped division strategy is applied. Two initial cluster centers are generated within this division process, initiated by the input tokens represented as

R^{C \times H W}

. This is achieved by reshaping the input dimension into

R^{C \times 2 \times \frac{H W}{2}}

and then averaging over the

\frac{H W}{2}

elements to obtain the two-dimensional centers represented as

R^{C \times 2}

. Subsequently, the attention mechanism, devoid of the Softmax operation, is employed to evaluate the relationships between these centers and the other tokens:

R e l a t i o n (T_{o}, C_{i n i}) = T_{o} N o r m (\frac{T_{o}^{T} C_{i n i}}{\sqrt{d_{k}}})

(2)

where

T_{o}

denotes the input token (

T_{o}^{T}

is the transposed token), functioning as both the query and value, while

C_{i n i}

represents the cluster center, serving as the key.

d_{k}

is the dimension number of the key. The function

N o r m

refers to the normalization operation. Its output, with the dimension of

R^{H W \times 2}

, represents the distance between other tokens and the two central tokens. By ranking these distance values, we can classify half of the tokens to one cluster center, while the remaining half can be assigned to the other center.

Furthermore, this process can be iterated T times, with the cluster centers doubling in number at each iteration. This can be represented as follows: at the T-th iteration, each sub-token is allocated to the division of

H W / 2^{T - 1}

. Moreover, to enhance interaction among distinct cluster segments, we employ an overlapping strategy. In each iteration, the token division is conducted equitably, and critical position tokens in these equally divided regions are reused by adding them back into the following local attention calculation process. Formula (3) for this process is outlined below:

C l u s t e r A t t M a p (Q_{c}, K_{c}) = softmax (\frac{Q_{c} {[K_{c}; O p]}^{T}}{\sqrt{d_{k}}})

(3)

where

[Q_{c}]

and

[K_{c}; O p]

represent the query and key in each cluster region, respectively.

O p

represents the overlapped tokens in the division region.

C l u s t e r A t t M a p (Q_{c}, K_{c})

represents that only similarity maps during attention are needed in the cluster-finding block. This process constructs the dynamic local region interaction in an overlapped manner.

3.3.4. Local–Global Squeezing Block

In the local–global squeezing block, depicted in the middle part of Figure 3, we incorporate the global attention mechanism, scattering operation, and Hadamard product. The primary challenge lies in effectively extracting meaningful information in different attention scales to address scale variation and limited UAV perspectives.

The primary challenge lies in formulating an efficient mechanism for extracting rich semantic details from global attention, thereby empowering the disparate cluster regions to effectively tackle the aforementioned obstacles. To elucidate this process, we commence by scattering the attention map of each cluster to match the dimensions of the global attention map. Specifically, given the input after

softmax (Q_{c} {[K_{c}; O p]}^{T})

, we initially adjust its dimensions from

R^{N \times \frac{H W}{N} \times (\frac{H W}{N} + O p)}

R^{N \times \frac{H W}{N} \times \frac{H W}{N}}

, where

O p

represents the additional dimension introduced by overlapping tokens. This step results in the generation of N cluster centers, each capable of producing an attention map of identical dimensions. Subsequently, through the cropping operation, we create a rectangular scattered attention map of

R^{H W \times H W}

, aligning its size with that of the global attention depicted in the bottom part of Figure 3.

Subsequently, each cluster region’s scattered attention map is padded with values by their respective indexes, while zero-padding is applied to positions lacking indexes. Then, a Hadamard product is computed between each scattered attention map and the global attention map. This operation illustrates how a unified global attention map can guide the disparate cluster regions, with distinct local attention values squeezing the global attention at various positions. This dynamic integration of the global perspective into local regions serves to enhance the tracking performance of UAV objects.

3.3.5. Resetting Block

To reduce computation costs, we continue to employ a linear approach with a fixed down-sampling strategy. The local squeezing global attention (LSGA) primarily emphasizes local attention, and a resetting block is employed to reverse the scattering operation. This reversal involves converting the N cluster regions, initially represented as

R^{H W \times H W}

, back to the size of local attention with

R^{\frac{H W}{N} \times \frac{H W}{N}}

. Leveraging the indices generated by the cluster-finding block, these values are remapped to reconstruct the local attention map. Subsequently, the attention mechanism is completed by multiplying the tokens from

[V_{c}; O p]

R e s e t A t t (Q_{c}, K_{c}, V_{c}) = softmax (\frac{{\hat{Q}}_{c} {[{\hat{K}}_{c}; O p]}^{T}}{\sqrt{d_{k}}}) [V_{c}; O p]

(4)

where

[V_{c}; O p]

represents value, and

{\hat{Q}}_{c}

and

{\hat{K}}_{c}

denote the query and key after reversing operation.

3.3.6. Conventional Global Attention

Significantly, to maintain tracking precision, conventional global attention (CGA) is employed. CGA comprises two-layer linear projections (MLPs), Rectified Linear Units (ReLUs), residual connections, and multi-head attention (MHA). Utilizing features from both the target and search branches (

f_{z}

f_{x}

), the detailed calculation in the template branch can be expressed as:

\begin{matrix} F_{z} = L i n e a r_{z} ((f_{z})) \\ F_{x} = L i n e a r_{x} ((f_{x})) \\ {\hat{F}}_{z} = F_{z} + L N (M H A (F_{x}, F_{z}, P_{z})) \\ {\hat{F}}_{z} = {\hat{F}}_{z} + L N (M L P ({\hat{F}}_{z})) \end{matrix}

(5)

where

{\hat{F}}_{z}

represents the output from the template.

P_{z}

is the position embedding.

M H A

follows the softmax attention shown in FWA without region division. Figure 4 demonstrates the difference between the FWA and CGA, and the default head number of CGA is 8.

3.4. Tracking Head

The tracking head comprises two branches, classification and regression, each employing MLPs with three layers. The classification branch serves to categorize positive object samples, while the regression branch estimates the center point, width, and height. We apply weighted binary cross-entropy loss for classification and utilize

l_{1}

loss and

C I o U

loss [37] for regression tasks:

L_{c l s} = - \sum_{i} (λ_{i} log (y_{i}) + (1 - λ_{i}) log (1 - y_{i}))

(6)

L_{r e g} = λ_{C I o U} L_{C I o U} + λ_{l} L_{l}

(7)

where

y_{i}

represents the probability output derived from the tracking head, while

λ_{C}

and

λ_{l}

denote the hyper-parameters associated with the

C I o U

loss and

l_{1}

loss, respectively.

L_{c l s}

is the loss of classification and

L_{r e g}

is the loss of regression. In our practical implementation, we assign values of

λ_{C I o U} = 2

and

λ_{l} = 5

to these hyper-parameters which is the default value setting in current popular tracking pipelines.

4. Experiments

This part discussing experiments begins by presenting the specifics of the training and inference processes alongside the dataset configurations. Additionally, an overview of the evaluation metrics utilized is provided. Subsequently, a thorough analysis of the tracker’s performance is conducted on both UAV datasets [38] and GOT-10K [39], showcasing its efficacy compared to other SOTA trackers on UAVs. Finally, the ablation experiments, visualizations, and future works are performed to assess the influence of individual components and give our tracker a qualitative analysis to show its advantages and disadvantages.

Training: Our tracker is trained using 8 Nvidia RTX 3090 GPUs while the inference process is only conducted on one of them. The platform is Ubuntu 20.04.4 LTS, utilizing Python 3.10 and the PyTorch 2.4 framework. The CPU utilizes an Intel i7-12700KF with 12 cores, 20 threads, and 128GB RAM. The training process has 200 epochs, totaling approximately 60 h. During training, we employ a batch size of 36 and process 1500 images per epoch sourced from datasets including COCO [40], GOT-10k [39], LaSOT [41], and TrackingNet [42]. Notably, the maximum sampling interval within a video during training is capped at 100 frames. Optimization is conducted using AdamW, where the learning strategy uses the step decline strategy with the range from 1 × 10⁻⁵ to 1 × 10⁻⁶ on the tracking backbone and 1 × 10⁻⁴ to 1 × 10⁻⁵ on the tracking neck by the 150th epoch. As for the TrackingNet, only the first 4 lists are used for training. The inference speed is 23 FPS (Frames Per Second) which matches the real-time requirement.

Dataset: The evaluation of our tracker heavily relies on the UAV123 dataset [38], which serves as a foundational dataset in our analysis. This dataset is composed of 123 UAV sequences, where 11,000 labeled UAV frames are used for the comprehensive evaluation. For each frame, there are 12 different object attributes, facilitating a complete analysis of the popular trackers. The dataset also encompasses challenges like clutters and partial occlusion in real-world UAV tracking scenarios.

Inference: The inference process follows the same tracking task as current trackers, where the features from the template branch remain unchanged. It is crucial for capturing the initial position and rectangles of the object. Transitioning to the search branch, the video sequence will be sent to this branch frame by frame. What is more, the center point, width, and height learned in this time step will serve as the center position to extract the subsequent search region in each frame.

Evaluation metrics: The evaluation of the UAV dataset strictly follows the one-pass evaluation (OPE) protocol, where the tracker is started only once for each sequence and cannot be reinitialized in case of tracking failures. The reported metrics include the average precision and success across all UAV sequences. The precision measures the percentage of frames successfully tracked, where the center location error remains below a predefined pixel threshold. This error is calculated as the distance between the center position and the prelabeled ground truths. Then, the success is defined by the intersection over union (IOU) of

\frac{|R_{o} \cap R_{l}|}{|R_{o} \cup R_{l}|}

alignment with the target object. This metric quantifies the percentage of frames, wherein the overlap ratio between the predicted tracking box and the actual bounding box exceeds a predetermined threshold. Here, the default threshold is set at 0.5, with the rectangles of estimated boxes and bounding boxes denoted by

R_{o}

and

R_{s}

, respectively.

4.1. Tracker Comparison

In the tracker comparison section, our algorithm undergoes evaluation against current SOTA algorithms, including ECO [43], SiamAttn [44], ATOM [29], TransT [15], TrDiMP [45], TrSiam [45], SiamBAN [17], SiamITL, ParallelTracker, and SiamPT, among others. Our tracker demonstrates SOTA performance across the UAV dataset. Given that our model builds upon the dual-branch tracker baseline, we primarily conduct comparisons with this kind of tracker.

Illustrated in Figure 5, the overall performance assessment is conducted on the UAV123 dataset, measured by success and precision. It shows that our tracker with Macaron Attention achieves SOTA precision and success scores of 0.911 and 0.710, respectively. Compared with SiamPT and TransT, our tracker exhibits superior performance with precision and success surpassing 2.4% and 2.3% for SiamPT and 3.6% and 4.3% for TransT. These results prove the leading accuracy and robustness of our algorithm.

4.1.1. UAV123 Benchmark

In Table 1, we present a comprehensive UAV123 analysis with the following five metrics: success, precision, speed (inference), platform, and parameters. Our tracker demonstrates leading tracking performance with 25 FPS. Notably, the recent SiamEMT achieves 0.627 on success and 0.819 on precision with the parameters of 71.2 MB. For our tracker, it exhibits a remarkable performance increase with precision (0.911) and success (0.710), surpassing SiamEMT by 13.2% and 11.2% on success and precision, respectively.

Furthermore, compared to SiamPT, which shares larger parameters and a fast inference speed, our tracker still demonstrates significant performance improvement while maintaining a slower inference speed. This indicates that although our method has a lower parameter size, it incurs a higher computational cost compared to SiamPT, slowing down the inference speed.

Moreover, in contrast to ParallelTracker, which shares the same inference speed as our tracker (25 FPS), our tracker surpasses ParallelTracker by 2.6% and 0.6% in success and precision, respectively. Additionally, ParallelTracker boasts a substantially larger model size of 47.57 MB, surpassing our model by 25 MB. This comparative analysis underscores our tracker’s robust comparative ability, achieving remarkable performance with the same speed despite having a smaller model size.

In terms of parameter analysis, Table 1 depicts the parameter comparison where our tracker, equipped with Macaron Attention, has 22.3 M parameters. Specifically, the CNN backbone EfficientNet-V2, Transformer neck, and tracking head contribute 10.6 M, 11.33 M, and 0.19 M parameters, respectively. Comparatively, SiamPT employs the ConvNeXt-V1 backbone, tracking neck and head, with parameter counts of 12.33 M, 6.8 M, and 12.8 M, respectively. Notably, our tracker exhibits an increase in the parameter size of the tracking neck while decreasing the parameter size of the tracking head compared to SiamPT. However, it is worth noting that the complexity of the tracking neck may potentially impact the tracking speed.

In Figure 6, our tracker depicts its adeptness in effectively navigating through challenging scenarios involving similar objects, partial occlusions, background clutter, and changes in viewpoint. Particularly in scenarios with background clutter, our tracker demonstrates a significant performance lead over the second-best algorithm TrSiam, boasting a 10.7% higher success and an 8.4% higher precision relative to TrSiam. Similarly, when faced with partial occlusions, our tracker also outperforms TrSiam, exhibiting a 2.2% higher success and a 1.9% higher precision relative to TrSiam. In handling viewpoint changes, our tracker excels, achieving a 1.1% higher success and a 0.8% higher precision compared to TrSiam. Moreover, when dealing with similar objects, our tracker maintains a slight edge, surpassing competitors by 0.4% in success and 0.1% in precision relative to TrSiam.

Significantly, our tracker exhibits a substantial advantage in effectively handling background clutter and partial occlusions. While our tracker maintains its top position in scenarios involving similar objects and changes in viewpoint, the margin of this advantage is relatively modest. This is primarily attributed to the inherent complexity of addressing similar object challenges, which often necessitates the integration of additional modules into the tracking pipeline, such as KeepTrack [26]. However, among the evaluated trackers, TransT stands out for its classical Transformer-based dual-branch structure, showcasing acceptable performance. Despite the notable performance of TransT, our tracker significantly outperforms it across all attributes (UAV 12 different attributes).

As depicted in Figure 7, our tracker demonstrates competence in managing scenarios involving low resolution, illumination variation, out-of-view, and camera motion. Particularly noteworthy is our tracker’s performance in the out-of-view, illumination variation, low resolution, and camera motion scenarios, where it surpasses the second-best algorithm by 1.9%, 1.7%, 2.4%, and 1.4% in success, and by 0.4%, 0.9%, 2.7%, and 1.4% in precision, respectively.

4.1.2. UAV123 Benchmark

As illustrated in Figure 8, our tracker continues to exhibit leading performance in scenarios involving fast motion, scale variation, full occlusion, and aspect change (with only lower precision in fast motion). This reinforces our hypothesis regarding the efficacy of the Macaron Attention in handling scale variation and aspect change. Notably, our tracker performs well in scale variation, full occlusion, and aspect change, where it outperforms the second-best algorithm by 2.6%, 4.4%, and 2.9% in success, and by 2.7%, 3.1%, and 3.1% in precision, respectively.

4.1.3. UAV20L Benchmark

The UAV20L benchmark serves as an assessment dataset tailored for evaluating long-term tracking algorithms amidst intricate real-world UAV tracking environments. It comprises 20 extensive sequences averaging nearly 3000 frames per sequence. In our experiments conducted on the UAV20L benchmark, we systematically benchmark our algorithm against current trackers comprising SiamPT, SiamAPN++, SiamAPN, SiamRPN++, and SGDViT, ensuring a comprehensive evaluation. As illustrated in Table 2, our algorithm prominently emerges as the top performer, where the precision and success are 0.891 and 0.694, respectively.

In long-term tracking scenarios, scale variation poses substantial challenges, often leading to tracking drift. Notably, significant changes in scale can profoundly impact tracking performance. Therefore, it becomes imperative for a tracker to exhibit robust adaptability to handle scale variations and mitigate UAV perspective limitations effectively. Therefore, our tracker integrates Macaron Attention, specifically tailored to bolster scale-handling capabilities, thereby attaining unparalleled performance in UAV20L.

4.1.4. GOT-10k Benchmark

GOT-10k [39] is a short-time tracking benchmark encompassing a diverse array of tracking scenarios. We tailor our model to effectively tackle scale variation and UAV perspective limitations, further substantiating our tracker’s exceptional performance through the assessment of this general dataset.

For this evaluation, we leverage two fundamental metrics: Area Overlap (AO) and success (SR). The AO metric gauges the extent of overlap between the predicted bounding boxes and the ground-truth annotations. Conversely, the SR metric quantifies the percentage of frames where the overlap exceeds predefined thresholds (Performs a similar function of success and precision). Notably, for a fair comparison, the training process exclusively relies on data from the GOT-10k dataset.

As demonstrated in Table 3, our algorithm presents commendable tracking performance, albeit with a slight decrease compared to SiamPT. Excluding SiamPT, our tracker achieves the highest AO and SR, with values of 0.718 and 0.817, respectively. It proves its robustness and efficacy in tracking diverse objects across challenging scenarios.

4.2. Visualization

As depicted in Figure 9, our tracker exhibits outstanding performance across a range of UAV video sequences. Notably, when confronted with scale variation, Macaron Attention dynamically combines local and global attention, thereby enhancing its capability to adapt to such fluctuations. This adaptability proves crucial in effectively handling consisting scale variations. Moreover, when faced with challenging UAV perspectives, our tracker excels in accurately tracking such viewpoints. The integrated fusion of FWA, LSGA, and CGA modules within the Macaron Attention mechanism empowers it to effectively tackle the diverse challenges encountered in UAV tracking scenarios.

4.3. Limitation and Future Expectancy

Although our MATrack has demonstrated enhanced tracking performance, there are some drawbacks worth noting. Similar to SiamPT, one of the most notable shortcomings of our tracker is that it does not add an extra template updating module. Moreover, we recognize the potential benefits of incorporating a squeezed attention mechanism, particularly for tracking smaller objects such as UAVs. However, our current tracker’s inference speed is decreased compared with SiamPT, which should be considered in the future. By enhancing the model’s ability to attend to objects at different scales, we anticipate improved performance while keeping a fast inference speed.

4.4. Ablation Study

In Table 4, we display the influence of the different components of MATrack. It is noteworthy that our baseline model is Transt, indicating that without FWA and LSGA, our model reduces to a Transt structure. Upon considering FWA, it can be seen as a simplified version of Swin Transformer, excluding the window shifting operation that significantly affects the tracking performance. However, when introducing LSGA, applying this module alone to tracking tasks results in a noticeable performance increase, as it already accounts for the combination of local and global attention. Additionally, combining FWA and LSGA leads to a significant overall performance improvement, as it forms an attention model with a progressively expanding receptive field. Furthermore, the number of CGA layers impacts tracking performance, denoted as

C G A_{l 1}

C G A_{l 3}

, representing the number of layers used from 1 to 3. For instance,

C G A_{l 3}

indicates that three CGAs in series are utilized in our model, where the output from LSGA undergoes three sequential CGA operations. The results indicate that utilizing two layers of CGA yields the best tracking performance.

5. Conclusions

In this study, we introduce a novel Macaron Attention mechanism integrated into the tracking pipeline to address the challenges posed by perspective constraints and scale variations encountered in Unmanned Aerial Vehicle (UAV) videos. Our proposed approach presents a progressive attention mechanism transitioning from local to global, termed the “squeezing” method. This method comprises three key components: fixed window attention (FWA), local squeezing global attention (LSGA), and conventional global attention (CGA). The FWA component serves as the fundamental local attention realization. However, its fixed window design may struggle to accommodate the scale variations of objects. To address this limitation, we employ the LSGA module, which adapts to scale variations through an adaptive clustering-based token aggregation strategy and incorporates a “window-to-window” alignment approach to seamlessly integrate global and local attention. Furthermore, the LSGA module efficiently converts attention outcomes back into the local attention format to conserve computational resources. Finally, the CGA module is utilized to preserve the Transformer’s original global modeling capabilities. Through the integration of Macaron Attention, our tracker demonstrates state-of-the-art performance and maintains acceptable inference speeds on UAV tracking benchmarks.

Author Contributions

Conceptualization, Z.W.; methodology, Z.W. and M.L.; software, Z.W. and D.L.; validation, Z.W.; formal analysis, Z.W.; investigation, Z.W.; resources, M.L. and H.L.; data curation, M.L. and Y.L.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W.; visualization, Z.W.; supervision, J.Z.; project administration, Q.B. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China under Grant 62101529 of Meihui Li and the Postdoctoral Fellowship Program of CPSF GZC20232676 of Dongxu Liu.

Data Availability Statement

The data are available from the corresponding author upon request. The data are not publicly available due to future extended version.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 5–20 June 2019; pp. 4282–4291. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Xu, Q.; Deng, H.; Zhang, Z.; Liu, Y.; Ruan, X.; Liu, G. A ConvNeXt-based and feature enhancement anchor-free Siamese network for visual tracking. Electronics 2022, 11, 2381. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning. PMLR, Online, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Wang, Z.; Yao, J.; Tang, C.; Zhang, J.; Bao, Q.; Peng, Z. Information-diffused graph tracking with linear complexity. Pattern Recognit. 2023, 143, 109809. [Google Scholar] [CrossRef]
Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 5–20 June 2019; pp. 4591–4600. [Google Scholar]
Deng, A.; Han, G.; Chen, D.; Ma, T.; Liu, Z. Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking. Remote Sens. 2023, 15, 2857. [Google Scholar] [CrossRef]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 146–164. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6668–6677. [Google Scholar]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6269–6277. [Google Scholar]
Zheng, Z.; Wan, Y.; Zhang, Y.; Xiang, S.; Peng, D.; Zhang, B. CLNet: Cross-layer convolutional neural network for change detection in optical remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 247–267. [Google Scholar] [CrossRef]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
Chen, B.; Li, P.; Bai, L.; Qiao, L.; Shen, Q.; Li, B.; Gan, W.; Wu, W.; Ouyang, W. Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 375–392. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 771–787. [Google Scholar]
Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 13444–13454. [Google Scholar]
Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8731–8740. [Google Scholar]
Fu, Z.; Fu, Z.; Liu, Q.; Cai, W.; Wang, Y. SparseTT: Visual Tracking with Sparse Transformers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 905–912. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
Wang, Z.; Zhou, G.; Yao, J.; Zhang, J.; Bao, Q.; Hu, Q. Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos. Remote Sens. 2024, 16, 748. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and better learning for bounding box regression. arXiv 2020, arXiv:1911.08287. [Google Scholar] [CrossRef]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 5–20 June 2019; pp. 5374–5383. [Google Scholar]
Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6728–6737. [Google Scholar]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]

Figure 1. The overview diagram of dual-stream tracking pipeline integrated with Macaron Attention. It comprises a tracking backbone, a Macaron tracking neck, and a tracking head. Macaron Attention is specifically designed to address challenges related to scale variation and limited perspective in UAV targets, utilizing local squeezing global attention to effectively tackle these issues.

Figure 2. The overview of fixed window attention (FWA). It partitions the feature tokens into non-overlapping small patches and implements attention within each local region. The red rectangle means the non-overlap division, the circle with the black rectangle is the tokenized features, and the red dash represents the combination of tokenized features.

Figure 3. The overview of LSGA. Local attention and global attention are directly combined to realize efficient information interaction. It includes the cluster-finding block, local–global squeezing block, and resetting block. The circle with the rectangle means the tokens like Swin and CSwin.

Figure 4. The overview of conventional global attention. Only global attention is taken into account to enrich the global information, where query, key, and value all occupy the whole feature token.

Figure 5. The overall performance comparison of our tracker with success and precision.

Figure 6. Precision and success on UAV123 dataset in similar object, partial occlusion, background clutters, and viewpoint change.

Figure 7. Precision and success on UAV123 dataset in Out-of-view, illumination variation, low resolution, and camera motion.

Figure 8. Precision and success in fast motion, scale variation, full occlusion, and aspect change.

Figure 9. The quality analysis on visualization. It includes the comparison of the bounding box, our outcomes, Transt, TrDiMP, SiamCAR, and ATOM.

Table 1. The comprehensive performance evaluation includes a comparison with the current UAV-benchmark trackers, encompassing aspects such as inference speed, applied platform, and parameters. Mt and Md denote the metrics and methods.

	SiamITL	SiamEMT	ParallelTracker	SiamPT	Ours
Mt	SiamITL	SiamEMT	ParallelTracker	SiamPT	Ours
Success	62.5	62.7	69.2	69.4	71.0
Precision	81.8	81.9	90.5	89.0	91.1
Inference Speed(FPS)	193	25	30	91	25
Platform	RTX3090	RTX3090	RTX2070	RTX3090	RTX3090
Parameters	65.4 M	71.2 M	47.6 M	32.8 M	22.3 M

Table 2. The comparison conducted on UAV20L dataset. Mt and Md denote the metrics and methods.

	SGDViT	SiamRPN++	SiamAPN	SiamAPN++	SiamPT	Ours
Mt	SGDViT	SiamRPN++	SiamAPN	SiamAPN++	SiamPT	Ours
Success	51.9	57.9	51.8	53.3	65.3	69.4
Precision	69.2	75.8	69.2	70.3	84.8	89.1

Table 3. The comparison conducted on GOT-10k. Mt and Md denote the metrics and methods. benchmark.

	AutoMatch	SBT	SLT	STARK	TransT	OSTrack	SiamPT	Ours
Mt	AutoMatch	SBT	SLT	STARK	TransT	OSTrack	SiamPT	Ours
AO	65.2	70.4	67.5	68.8	67.1	71.0	72.5	71.8
SR0.5	76.6	80.8	76.5	78.1	76.8	80.4	82.7	81.7
SR0.75	54.3	64.7	60.3	64.1	60.9	68.2	67.0	69.2

Table 4. Success on the ablation study. FWA means fixed window attention, LSGA means local squeezing global attention, and

C G A_{l 1}

C G A_{l 3}

means the layer numbers 1 to 3 of conventional global attention.

Table 4. Success on the ablation study. FWA means fixed window attention, LSGA means local squeezing global attention, and

C G A_{l 1}

C G A_{l 3}

means the layer numbers 1 to 3 of conventional global attention.

No.	FWA	LSGA	${CGA}_{l 1}$	${CGA}_{l 2}$	${CGA}_{l 3}$	SR	Inference Speed (FPS)
1	✓	✕	✕	✓	✕	68.2	49.8
2	✕	✓	✕	✓	✕	69.4	27.9
3	✓	✓	✓	✕	✕	70.2	25.2
4	✓	✓	✕	✓	✕	71.0	23.1
5	✓	✓	✕	✕	✓	70.3	21.0

✓ means used and ✕ means unused.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Luo, H.; Liu, D.; Li, M.; Liu, Y.; Bao, Q.; Zhang, J. Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks. Remote Sens. 2024, 16, 2896. https://doi.org/10.3390/rs16162896

AMA Style

Wang Z, Luo H, Liu D, Li M, Liu Y, Bao Q, Zhang J. Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks. Remote Sensing. 2024; 16(16):2896. https://doi.org/10.3390/rs16162896

Chicago/Turabian Style

Wang, Zhixing, Hui Luo, Dongxu Liu, Meihui Li, Yunfeng Liu, Qiliang Bao, and Jianlin Zhang. 2024. "Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks" Remote Sensing 16, no. 16: 2896. https://doi.org/10.3390/rs16162896

APA Style

Wang, Z., Luo, H., Liu, D., Li, M., Liu, Y., Bao, Q., & Zhang, J. (2024). Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks. Remote Sensing, 16(16), 2896. https://doi.org/10.3390/rs16162896

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Macaron Attention: The Local Squeezing Global Attention Mechanism in Tracking Tasks

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Overview

3.2. Tracking Backbone

3.3. Macaron Attention Realization

3.3.1. Fixed Window Attention

3.3.2. Local Squeezing Global Attention

3.3.3. Cluster-Finding Block

3.3.4. Local–Global Squeezing Block

3.3.5. Resetting Block

3.3.6. Conventional Global Attention

3.4. Tracking Head

4. Experiments

4.1. Tracker Comparison

4.1.1. UAV123 Benchmark

4.1.2. UAV123 Benchmark

4.1.3. UAV20L Benchmark

4.1.4. GOT-10k Benchmark

4.2. Visualization

4.3. Limitation and Future Expectancy

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI