Open AccessArticle

Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos

Zhixing Wang

^1,2,3,4,5,

Gaofan Zhou

^1,3,

Jinzhen Yao

^1,3,

Jianlin Zhang

^1,3,*

Qiliang Bao

^1,3 and

Qintao Hu

^1,3

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

Key Laboratory of Optical Engineering, Chinese Academy of Sciences, Chengdu 610209, China

⁴

School of Electronic, Electrical and Communication Engineering, Chinese Academy of Sciences, Beijing 100049, China

⁵

National Key Laboratory of Optical Field Manipulation Science and Technology, Chinese Academy of Sciences, Chengdu 610209, China

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(5), 748; https://doi.org/10.3390/rs16050748

Submission received: 28 December 2023 / Revised: 1 February 2024 / Accepted: 8 February 2024 / Published: 21 February 2024

Download

Browse Figures

Figure 1
Comparison of different tracking pipelines of Transformer-based trackers. (a) is the conventional tracking pipeline. (b) is the novel prompting tracking pipeline (CVPR2023). (c) is our proposed tracking pipeline with a self-prompting mechanism. "> Figure 2
The overview of SiamPT with the CNN backbone, Transformer neck, and Tracker head stages. An efficient ConvNeXt backbone is used as the foundation of our tracker. Within the Neck block, the Prompter Generation Module (PGM) is responsible for the extraction of prompters from the global attention mechanism. The Feature Division Module (FDM) is to categorize tokens into different classifications, ensuring the ability to distinguish targets and background interference. "> Figure 3
The overview of Transformer Neck. It includes the details of our proposed FDM and PGM. The different colors denote the different clustering regions in the feature map. "> Figure 4
The overview of attention structure in SiamPT where the Multi-Head Attention (MHA), Feed-Forward Network (FFN), Feature Division Module (FDM), and the Prompter Generation Module (PGM) are involved. "> Figure 5
The process to generate the prompter. The local attention and global attention can guide each other to form a prompter. Similarly, the different colors denote the different clustering regions in the feature map. "> Figure 6
Comparison on UAV123 dataset in overall performance. It indicates the success rate and precision of our proposed SiamPT. "> Figure 7
Comparison on UAV123 dataset in different attributes. Camera Motion, Aspect Ratio Change, and Full Occlusion are included. "> Figure 8
Comparison on UAV123 dataset in different attributes. Scale Variation, Viewpoint Change, and Low Resolution are included. "> Figure 9
Comparison on UAV123 dataset in different attributes. Illumination Variation, Fast Motion, and Out-of-View are included. Especially, in these attributes, our model is not in the leading position. "> Figure 10
The visualization outcomes conducted on UAV123 dataset. ">

Versions Notes

Abstract

In the realm of visual tracking, remote sensing videos captured by Unmanned Aerial Vehicles (UAVs) have seen significant advancements with wide applications. However, there remain challenges to conventional Transformer-based trackers in balancing tracking accuracy and inference speed. This problem is further exacerbated when Transformers are extensively implemented at larger model scales. To address this challenge, we present a fast and efficient UAV tracking framework, denoted as SiamPT, aiming to reduce the number of Transformer layers without losing the discriminative ability of the model. To realize it, we transfer the conventional prompting theories in multi-model tracking into UAV tracking, where a novel self-prompting method is proposed by utilizing the target’s inherent characteristics in the search branch to discriminate targets from the background. Specifically, a self-distribution strategy is introduced to capture feature-level relationships, which segment tokens into distinct smaller patches. Subsequently, salient tokens within the full attention map are identified as foreground targets, enabling the fusion of local region information. These fused tokens serve as prompters to enhance the identification of distractors, thereby avoiding the demand for model expansion. SiamPT has demonstrated impressive results on the UAV123 benchmark, achieving success and precision rates of 0.694 and 0.890 respectively, while maintaining an inference speed of 91.0 FPS.

Keywords:

transformer; UAV tracking; prompting

1. Introduction

In the field of remote sensing videos, ensuring robust tracking of Unmanned Aerial Vehicles (UAVs) is of vital significance, given its extensive applicability across diverse domains, including human-computer interactions, autonomous navigation, and military operations [1]. The fundamental objective of UAV tracking is to predict the target’s position in subsequent frames, using only the information accessible in the initial frame. While challenges such as target scale variations, occlusion, and similar interference have been progressively addressed by current trackers, the increase in model scale and complexity continues to impede the trade-off between tracking accuracy and inference speed. Therefore, the request for enhancing tracking performance while minimizing model costs poses a significant challenge.

To tackle the aforementioned challenges in UAV tracking, powerful backbones have been applied for Siamese trackers, where AlexNet [2] and ResNet [3] were utilized for feature extraction in SiamFC [4] and SiamRPN [5]. Furthermore, information integration strategies were also explored by constructing the interaction mechanism between the target and search region. Notably, trackers like SiamFC++ [6], SiamBAN [7], OCEAN [8], and SiamCAR [9] directly predicted the targets’ bounding box without anchors, achieving more accurate localization performance. Online trackers like DCF [10], KCF [11], ATOM [12], and DiMP [13] could dynamically model the variance of target appearance through online learning. Noticeably, attention mechanisms are proven to be effective in tracking tasks by focusing on the most relevant features to model the target. SiamAttn [14] effectively captured contextual information and enhanced the interplay among features by incorporating deformable self-attention and cross-attention. Nonetheless, those approaches do not substantially explore the attention mechanism’s global modeling capabilities, limiting the representation ability of UAV targets.

Recently, the growing interest in Transformer-based models can be attributed to their superior performance. In tracking tasks, the attention mechanism of the Transformer facilitates capturing long-range dependencies which bolsters information integration among UAV targets. TransT, STARK, and IDGtrack [15,16,17] incorporated Transformers to globally model information interaction, predominantly enhancing performance. In particular, AiATrack [18] extended the attention module’s capacity and conducted an Attention in Attention strategy on the correlation map. Mixformer [19] introduced an end-to-end Transformer architecture (Mixed Attention Module) instead of dividing the attention branch. IDGtrack [17] designed novel linear attention integrated into the graph model achieving a fast inference speed and acceptable performance. Zhu et al. introduced ViPT [20], which incorporates supplementary auxiliary inputs into the prompt-tuning process, enabling the base model (ViT) [21] to learn the interrelation between different modalities. However, despite these approaches substantially exploring the attention mechanism’s global modeling capabilities, current trackers are still limited in their ability to balance the tracking accuracy and inference speed. The establishment of a tracker with fast inference speed and high accuracy remains an ongoing challenge in UAV tracking.

To be detailed, current trackers, such as TransT, have expanded their parameter scales by integrating an additional four Transformer layers into the Neck block. Additionally, the ViT-based tracker has exceeded 86 million parameters, even in its minimal configuration. In response, we present a novel self-prompting algorithm, denoted as SiamPT, integrated into the tracking pipeline. This algorithm is designed to avoid excessive growth in trackers’ scales while enhancing the discriminative capacity in distinguishing Unmanned Aerial Vehicles (UAVs) from background elements. We transfer principles from multi-model tracking theory to the domain of UAV tracking, aiming to reduce the number of Transformer blocks while enhancing the tracker’s discriminative capabilities with a subtle cost. Our strategy begins by fixing the number of Transformer blocks to two, effectively mitigating parameter expansion. Subsequently, a self-prompting technique, characterized by significantly fewer parameters compared to the Transformer blocks, is devised to identify salient features as prompters. Notably, only inherent model characteristics are employed to generate these prompters, establishing a single-branch prompting pipeline. Moreover, we design a novel method to integrate prompters and global attention which enhances spatial and channel-distributed information, thereby advancing tracking performance.

We substantiate the effectiveness of our proposed methodology on UAV123 and UAV20L tracking benchmarks for UAV tracking tasks and GOT-10K for general tracking tasks. Our innovative approach, SiamPT, not only attains state-of-the-art performance across these benchmarks but also demonstrates a superior inference speed of approximately 91.0 FPS.

Our major contributions are concluded below:

We propose a fast and efficient UAV tracking framework with a self-prompting algorithm, effectively striking a balance between tracking speed and accuracy. To the best of our knowledge, this represents the first effort to define a prompter exclusively based on single-branch features.
We introduce an innovative division strategy distinguishing the source of the prompter and the source of prompted features. The global attention mechanisms serve as the source of the prompter, while the local attention mechanisms act as the prompted features. This approach introduces a novel paradigm for the fusion of local and global information in a rapid fashion.
SiamPT undergoes comprehensive evaluation on well-established UAV tracking benchmarks: UAV123 and UAV20L. Furthermore, we validate its performance on the general dataset, GOT-10K. SiamPT not only achieves state-of-the-art results on these benchmarks but also excels in terms of rapid inference in the tracking domain.

2. Related Works

Siamese trackers. Current Siamese-based trackers are primarily driven by the seminal work of SiamFC [4]. SiamFC introduced a Siamese architecture with a correlation operation and multi-scale prediction. This breakthrough paved the way that focused on algorithms for integrating information, refining RPN regions, and incorporating diverse supervision mechanisms. For instance, SiamRPN [5] extended the task from the target location to the target regression by incorporating the region proposal network (RPN) [22]. Building on these advances, SiamRPN++ [23] explored the deep backbone with a selective sampling strategy to improve translation consistency. SiamSTM [24] adopted multiple matching networks to improve the tracking robustness in UAV videos. TRTrack [25] utilized the trajectory-aware pre-training method in UAV sequences to enhance the long-term modeling ability. STMtrack [26] introduced a novel template updating mechanism to adapt to target variations, while DaSiamRPN [27] improved training data quality to enhance generalization. For anchor-free structures, SiamBAN, SiamCAR, and SiamFC++ [6,7,9] removed the traditional RPN pipeline and directly predicted the differences among the bounding boxes and learnable boxes. Attention mechanisms used in SiamAttn and SiamGAT [14,28] enhanced the global information interaction.

Temporary trackers. KCF [11] and SRDCF [29] motivated the development of temporary trackers online which iteratively updated the target position with minimal loss between frames. DiMP [13] innovatively integrated concepts from them to tackle the issue of accurately modeling target variations. Building upon this, PrDiMP [30] introduced the variation principle in tracking tasks to approximate Gaussian labels to eliminate distractors. ATOM [12] incorporated an IoU-Net [31] into trackers that significantly improved trackers’ discriminating ability during online training. KeepTrack [32] followed the concept of multi-object tracking to form an association map to continually predict targets’ tragedy to discriminate distractors.

Transformer works. Transformer-based models have emerged as powerful tools for target tracking, demonstrating remarkable capabilities in various computer vision applications. These models use the multi-head attention mechanism, which utilizes query, key, and value to establish a comprehensive global perspective, allowing the model to capture information from the entire feature map. Building on the foundation of DETR [33], TransT [15] introduced innovative modifications, including the ego-context augmentation (ECA) and cross-feature augmentation (CFA) modules. These enhancements enable self-improvement in feature extraction and the interaction between different features. In addition to TransT, such a pipeline has emerged in TrDimp [34], which additionally incorporates the multi-head attention mechanism into a temporal model to handle template variations. ToMP [35] introduced the use of Discrete Cosine Transform (DCF) to dynamically capture temporal variations, while STARK [16] focused on concatenating template and search branch features and employing a random initial box to approximate bounding boxes. OSTrack [36] emphasized the enhancement of inference speed and the reduction of background interference through the elimination of insignificant sub-patches. SparseTT [37] employed the Top-K algorithm to discard interference by ranking the weights, and IDGtrack [17] pioneered the integration of linear attention into tracking tasks via a graph model to accelerate inference speed. Despite these advancements, conventional Transformer-based trackers still face challenges in striking the proper balance between tracking accuracy and inference speed. To address this issue, our proposed method introduces a novel self-prompter integrated into the attention mechanism, offering an accessible solution to these challenges. This innovation leads to significant improvements in tracking performance and achieves the desired balance between accuracy and speed in tracking tasks.

3. Preliminary

As shown in Figure 1, the conventional tracking pipeline (a) defines tracking tasks as

T : \{X_{T \to S}, B_{i n i t i a l}\} \to B

. This formulation essentially signifies that the primary objective of the tracker is to predict the bounding box B based on the initial box

B_{i n i t i a l}

, where

X_{T \to S}

denotes the Template tokens and Search tokens in two individual branches. Such a tracking pipeline may manifest as a single-branch or double-branch realization, involving the concatenation of template and search tokens, but it lacks efficient information interaction between these branches, especially in scenarios specific to remote sensing videos such as UAV tracking.

To address this challenge, for specific-domain tracking, ViPT [20] (the prompting tracking pipeline (b)) introduced an additional spatial synchronized input flow and extended the model input to (

X_{T \to S}, X_{m u l}

), where

X_{m u l}

represents other auxiliary modalities and

X_{T \to S}

is the two branches’ inputs. Consequently, the specific-domain tracking task can be formulated as

T_{m u l} : \{X_{T \to S}, X_{m u l}, B_{i n i t i a l}\} \to B

, where

T_{m u l}

denotes the specific-domain tracker.

Typically, the tracker T can be broken down into

ϖ \oplus ϕ

, where

ϖ : \{X_{T \to S}, B_{i n i t i a l}\} \to F_{T \to S}

and

F_{T \to S}

is the extracted feature. This part encompasses the CNN backbone and Transformer neck, which perform feature extraction. On the other hand, for the tracker head,

ϕ : F_{T \to S} \to B

is responsible for estimating the predicted bounding boxes. Here, ⊕ denotes the processing order. However, for our method with the display of (c) in Figure 1,

X_{S}

are divided into sub-groups which can be represented as the

X_{SG}

and meanwhile transformed into the global attention to extract the self-generated prompters

X_{sp}

. Then, this prompting process can be denoted as

T_{m u l} : \{X_{SG}, X_{s p}, B_{i n i t i a l}\} \to B

4. Methods

The overview of our SiamPT is illustrated in Figure 2. Our model comprises a CNN backbone, a Transformer neck module, and a tracker head. Within this neck module, we have intricately designed the Feature Division Module (FDM) and the Prompter Generation Module (PGM). These components are used to generate prompters that bolster the discriminative capabilities of our model, particularly in the context of sub-patch learning. Notably, the combination of MHCA, FDM, and PGM repeats only twice in the Transformer Neck. This innovative design greatly elevates the performance and inference speed.

4.1. CNN-Based Backbone

To fulfill the requisites of a lightweight tracking network, our method adopts traditional CNN-based backbones, as opposed to transformer structures. The pre-trained parameters within the CNN network assume a critical role, and for a valid comparative analysis, we do not utilize the MAE pre-trained versions. Current backbone strategies essentially fall into two categories: one relies on convolution and pooling layers to extract deep features, wherein the spatial context remains limited within a predefined window and grows larger with an increase in the number of layers; the other employs a transformer structure to extract comprehensive target features, with each layer obtaining a global spatial context.

Convolutional neural networks (CNNs) currently are regarded with limitations in effectively modeling long-term dependencies, due to their local shared convolution kernels. As a solution, the Swin Transformer [38], a pure Transformer backbone, is increasingly adopted in tracking tasks. Nonetheless, the remarkable performance of the Swin Transformer comes at a cost of considerable computational burden, as the dot-product operation combined with softmax entails quadratic computational complexity relative to the sequence length (

O (N^{2})

Considering the training and inference efficiency, we opt for the ConvNeXt-V1 [39] as our feature extraction backbone, omitting the last two layers. This choice carries a lower model complexity compared to the Swin Transformer. In both the template and search branches, we employ weight-shared convolution layers. The features extracted from the fourth and fifth stages in the ConvNeXt-V1 have channel sizes of

R^{N \times 192}

and

R^{N \times 384}

, respectively. To concatenate these representations, we initiate the process with a 1 × 1 convolution operation that unifies the number of channels to

R^{N \times 256}

. Subsequently, a straightforward fusion operation is deployed to amalgamate features extracted from the fourth and fifth stages. This integration is achieved through a simplistic weighted addition operation, allowing the efficient blending of features from these different stages.

f (t) = α Conv (f_{t_{- 1}} (t)) + (1 - α) Conv (f_{t_{- 2}} (t)))

(1)

f (s) = β Conv (f_{s_{- 1}} (s)) + (1 - β) Conv (f_{s_{- 2}} (s)))

(2)

where

Conv (\cdot)

denotes the

1 \times 1

convolution layer, and The functions

f_{t_{- 1}, t_{- 2}} (\cdot)

and

f_{s_{- 1}, s_{- 2}} (\cdot)

denote the outputs originating from the last two stages of the template and search branches, respectively.

α

and

β

stand as trainable parameters in our model. Then, the outcomes are sent to the MLP block to map the features into different feature spaces.

4.2. Self-Prompting Realization

In the implementation of the SiamPT pipeline, the primary innovative structure is the Transformer Neck. Illustrated in Figure 3, the bottom branch of our architecture utilizes global attention to generate prompters through Top-K modulation, complemented by guidance from the sub-patch branch. The prompter is concatenated with tokens within each sub-patch, effectively boosting the model’s discriminative capacity while ensuring rapid inference speed. These strategic enhancements collectively contribute to the improved performance of our tracker, all while maintaining an exceptional inference speed.

4.2.1. Encoder-Decoder Pipeline

As depicted in Figure 4, our approach adopts a similar architecture to DETR. In this design, the template branch, equipped with Multi-Head Attention (MHA), functions as the encoder, while the search branch, incorporating Multi-Head Cross Attention (MHCA), Prompter Generation Module (PGM), and Feature Division Module (FDM), acts as the decoder. In the attention architecture of SiamPT, the Feedforward network (FFN) consists of two fully connected layers. The first layer incorporates a Rectified Linear Unit (ReLU) activation function to introduce nonlinear transformations. To ensure a stable training process and mitigate the risk of gradient vanishing, residual connections, and layer normalization are employed. For position encoding, the general Sine-based method is employed to specify the attention position. Notably, the Transformer layer number N is set to 2 to save the computation cost and model parameters.

For MHA, it involves calculating similarities between queries, keys, and values. This process is mathematically expressed as:

M u l t i A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(3)

where, Q, K, and V represent query, key, and value, respectively.

d_{k}

denotes the scale factor by using the dimension number.

4.2.2. Feature Division Module (FDM)

In the context of Unmanned Aerial Vehicle (UAV) tracking, we must emphasize that the key challenges we face are primarily twofold. Firstly, these challenges pertain to the limited perspective offered by UAVs. Secondly, they revolve around the demand for a lightweight tracking pipeline that can operate at a fast speed. Regarding the first challenge, the constrained perspective of UAVs often leads to situations where background interference is introduced, especially in the presence of similar targets and occlusions. This problem is further exacerbated when dealing with small objects in the aerial view, which can intensify the interference issues. These scenarios pose particular challenges when utilizing global attention mechanisms, which tend to overlook the local attention of UAV targets. Consequently, undesired background information might dominate the primary components of the tracking process.

To tackle this challenge above, we have introduced the Feature Division Module (FDM), providing a fine-grained, local attention view. This vital enhancement has proven important in fortifying the algorithm’s robustness, particularly by efficiently filtering out distractors. Simultaneously, because of its reduced computation cost, it contributes significantly to enhancing the inference speed, aligning the tracking process with the fast requirements of UAV tracking.

In implementing the Feature Division Module (FDM), an essential aspect involves the equitable division of input vectors into two distinct groups across channels. Within this division process, in the k-th hierarchical level, each subset is allocated a share of

N / 2^{k - 1}

tokens. The partitioning of these tokens adheres to the Multi-Head Attention (MHA) criterion, as depicted below:

Q_{d} (s), K_{d} (s) = N o r m (M e a n (C h a n n e l_{D i v i s i o n} (f_{d} (s))))

(4)

where

C h a n n e l_{D i v i s i o n} (\cdot)

is the averaging channel operation and

N o r m

is the normalization operation in channel dimension.

Q_{d} (s)

and

K_{d} (s)

are divided groups, and

f_{d} (s)

represents the input tokens. Then

Q_{d} (s)

and

K_{d} (s)

are concatenated together to form

F_{d} (s)

(

{Q_{d} (s), K_{d} (s)} \to F_{d} (s)

) for the sequential attention operation:

R a n k s = S o r t ((f_{d} (s) {(F_{d} (s))}^{T}) f_{d} (s))

(5)

where

R a n k s

represents the similarity of the chosen two initial groups and

S o r t (\dot{)}

is the ranking operation by the values. It demonstrates that by repeating this process in N iterations,

R a n k s

can be divided into

N / 2^{k - 1}

sequences to indicate the dividing position.

4.2.3. Prompter Generation Module (PGM)

For the PGM, as illustrated in Figure 5, we use the attention mechanism previously defined in the initial tracking prompting algorithm to generate the prompters. The fundamental challenge here lies in how to extract valuable information from this global attention to facilitate the divided groups within the Feature Division Module (FDM) in distinguishing between distractors and UAV targets.

For details, we initiate the process by flattening the features and implementing the Feature Distillation Module (FDM) in the upper branch as illustrated in Figure 5, where the lower branch features are of dimension

R^{N \times C}

, and the upper branch features are of dimension

R^{N / 2^{k - 1} \times C}

(with k set to 3 in the figure). Subsequently, features from both branches undergo the operation

\frac{Q K^{T}}{\sqrt{d_{k}}}

to generate attention maps of dimension

R^{N \times N}

. Notably, the upper branch is capable of generating

2^{k - 1}

attention maps of dimension

R^{N / 2^{k - 1} \times N / 2^{k - 1}}

, and a scatter operation is employed to map the corresponding feature into an empty

R^{N \times N}

attention map based on division indices. Hence, the upper branch has a local attention attention map denoted as

C o r_{l o c}

while the lower one is

C o r_{g l o}

Following this, a Top-K mechanism is applied to selectively identify potentially relevant tokens, with values outside the Top-K range effectively set to zero. Both branches collectively generate an attention map with a reduced dimension of

R^{N \times K}

, where K is significantly smaller than C. This refined approach optimizes the token selection process, focusing on tokens deemed valuable for the tracking task.

C_{i j} = \{\begin{matrix} {C o r_{g l o b a l}, C o r_{l o c a l}}_{i j} & if C_{i j} \in T o p K \\ 0 & Others \end{matrix}

(6)

where

C_{i j}

represent the selected tokens. Then the added outcomes are sent into the Softmax block to finish an entire "local guiding global" attention map. Subsequently, we employ a Linear operation and a permuting operation to ensure that the dimensions of the group tokens align appropriately. This alignment step prepares the prompter for utilization in the Transformer-based architecture. Finally, we generate the prompters for each division block with less computation cost by introducing Top-K.

4.2.4. Prompter Integration Strategy (PIS)

To promote enhanced interaction between the prompter and the prompted groups, we use prompters by concatenating them with the Key and Value components within the attention mechanism in group tokens, effectively incorporating it into the attention mechanism of the model. This facilitates the transmission of valuable guidance to the divided groups and enhances the model’s ability to discern between distractors and UAV targets during the tracking process.

Remarkably, our PGM and FDM can be conceptualized as an innovative attention realization method with reduced computational complexity. In contrast to the global attention approach, our method exclusively operates on partitioned features containing smaller tokens. The formula is shown below:

P r o m p t A t t e n t i o n (Q, K, V) = softmax (\frac{Q_{d} (s) {[K_{d} (s); P]}^{T}}{\sqrt{d_{k}}}) [V_{d} (s); P]

(7)

where,

[K_{d} (s); P]

and

[V_{d} (s); P]

represent key and value after FDM will concatenate the extracted Prompter, respectively (with the dimension of

R^{N / 2^{k - 1} \times C}

). The incremental computational cost introduced by the prompter is judiciously managed through the utilization of Top-K. Notably, the ablation experimentation demonstrates a mere 0.5 MB augmentation in model size when compared with the baseline that solely employs Top-K in the attention map without a prompter.

Furthermore, it is imperative to highlight that the number of Transformer layers remains fixed at 2. This signifies that while the inference speed may be impacted by the presence of FDM and PGM, the overall tracking system attains an equilibrium between tracking accuracy and inference speed.

4.3. Double Head Layer

The double head layer is followed by SiamFC++ [6]. As shown in Figure 2, this module incorporates a fc-head, comprising two fully connected layers, and a conv-head, comprising L convolutional blocks. The objective function is formulated as the weighted summation of the classification loss and regression loss, delineated as follows:

\begin{matrix} L & = ω_{f c} \cdot [λ_{f c} L_{f c}^{class} + (1 - λ_{f c}) L_{f c}^{box}] \\ + ω_{conv} \cdot [(1 - λ_{conv}) L_{conv}^{class} + λ_{conv} L_{conv}^{boox}] \end{matrix}

(8)

where

λ_{f c}

ω_{f c}

, and

ω_{conv}

are hyper-parameters. For a fair comparison, we keep the same setting as SiamFC++ where L is set to 7.

5. Experimental Results and Analysis

After introducing the implementation details, this section initially depicts the training specifics concerning the dataset and hyper-parameters. It further delves into the inference process and the evaluation metrics employed. Subsequently, a comparative analysis is presented, showing the performance of SiamPT against other state-of-the-art methods across three UAV benchmarks. Following that, ablation studies are conducted to analyze the impact of each component and various design choices.

Training. SiamPT is trained on 4 Nvidia GeForce RTX 3090 GPUs from Santa Clara, CA, USA and tested on a single Nvidia GeForce RTX 3090 GPU from Santa Clara, CA, USA. The total training epochs are 20 with a training time of around 40 h. It consists of a batch size of 128, and 600,000 images per epoch on COCO [40], GOT-10k [41], LaSOT [42], and TrackingNet [43], along with ILSVRC VID [44], ILSVRC DET [44]. Notably, the tracing time is limited by the location of the hard drive storage where only the GOT-10k dataset is on the local disk. Consequently, the training duration for a single training session on the GOT-10k dataset extends to approximately 15 h when executed on a single Nvidia RTX 3090 GPU with 300,000 images per epoch. It is worth highlighting that the per-frame inference time demonstrates remarkable efficiency, with a processing time of merely 0.010 s, equivalent to a remarkable frame rate of 91.0 FPS when conducted on a single Nvidia RTX 3090 GPU. The optimization utilizes AdamW with a step learning strategy, adjusting the learning rate from 1 × 10⁻⁴ to 1 × 10⁻⁵ by the 10th epoch.

Dataset. The UAV123 dataset [45], a cornerstone of our evaluation, comprises 123 sequences of aerial remote-sensing videos captured by Unmanned Aerial Vehicles (UAVs). This dataset encompasses over 110,000 annotated image frames. Each frame is noted with 12 target attributes and bounding box annotations, providing an extensive and diverse collection of labeled data for tracking evaluation. It encompasses a wide array of challenges encountered in real-world UAV tracking scenarios like fast-drone or camera movements. These challenges include Camera Motion, Aspect Ratio Change, Full Occlusion, Illumination Variation, Fast Motion, Scale Variation, Viewpoint Change, Low Resolution, Out-of-View, Background Clusters, Similar Objects, and Partial Occlusion across these 123 video sequences. In addition to these challenges, the UAV123 dataset encompasses various tracking targets across different scenes, such as urban environments, roads, and water surface environments.

Inference. The inference procedure of SiamPT is intricately divided into two fundamental phases: initialization and tracking. Each phase plays a vital role in the tracking process and is defined as follows.

Initialization Phase: In the initialization phase, the tracker sets the foundation for the tracking process. The first frame of the video sequence acts as the template frame. This frame is critical for establishing the initial target representation and the associated bounding box coordinates. It is noteworthy that, in the spirit of maintaining fairness in our evaluation, we refrain from introducing any template updating mechanisms. This means that the features extracted from the template frame remain constant and unchanged throughout the entire tracking process.

Tracking Process: The tracking process can be regarded as a dynamic process. The video sequence, represented by the images in the search branch, is treated as a dynamic stream of frames. During this phase, the learned bounding box coordinates from each iteration are used to crop the next search region in each frame. SiamPT utilizes its innovative architecture to robustly and efficiently track the target throughout the entire video sequence, achieving the desired balance between speed and accuracy.

Evaluation metrics. The evaluation procedure adheres to the one-pass evaluation (OPE) protocol, a widely recognized standard in tracking evaluation. This protocol requires that the tracker cannot be re-initialized during the evaluation, providing a rigid assessment of its continuous tracking performance. Two primary metrics are employed in the evaluation of the tracking algorithm: Accuracy and Success Rate. Accuracy quantifies the tracker’s precision by measuring the percentage of frames where the center location error is less than a predefined pixel threshold. The default threshold used in this context is 20 pixels. The success rate assesses the tracker’s effectiveness in maintaining the intersection over union (IOU) alignment with the target. It is expressed as the percentage of frames in which the overlap ratio between the predicted tracking box and the actual bounding box exceeds a specified threshold. The default threshold for this metric is set at 0.5. The benchmarking process is facilitated by the PySOT toolkit. It includes attributes of 12 challenges mentioned before in the UAV123 dataset.

5.1. State-of-the-Arts Comparison

In this test, our algorithm was compared with SOTA algorithms, including TrDiMP [34], TrSiam [34], ATOM [12], TransT [15], SiamAttn [14], SiamBAN [7], ECO [46], SiamSTM [24], SiamITL, ParallelTracker, etc. Our SiamPT model reaches state-of-the-art performance, and the inference speed is around 91.0 FPS on a single RTX 3090 GPU. Since our model is built upon the TransT architecture, our primary comparisons are performed with trackers possessing similar structures.

5.1.1. UAV123 Benchmark

As shown in Figure 6, it depicts the overall performance conducted on the UAV123 dataset. The left part is the success plot and the right part is the precision plot. As mentioned before, the whole process is a one-pass evaluation (OPE). Our algorithm, SiamPT, achieves a remarkable success rate of 0.694, coupled with an impressive precision score of 0.890. Compared to the TransT, SiamPT has a success rate and accuracy rate that are 2% and 1.5% higher, respectively. This indicates that our algorithm has leading accuracy and robustness.

Table 1 provides a comparative analysis of the current UAV benchmark trackers, focusing on key metrics such as Success Rate, Precision, Inference Speed, Platform, and Model Parameters. The results indicate that our SiamPT excels in both tracking performance and inference speed. Notably, SiamSTM stands out with a remarkable inference speed of 193 FPS and the smallest model size of 31.1 MB, attributed to its lightweight backbone design. On the other hand, our approach, SiamPT, has a similar model size of 32.8 MB and a middle-fast inference speed of 91 FPS. However, SiamPT has leading performance metrics in success rate of 0.694 and a precision of 0.890, which outperform SiamSTM by approximately 13% and 10%, respectively.

Furthermore, in contrast to ParallelTracker, which exhibits commendable performance with a success rate of 0.692 and a precision of 0.905, its inference speed is limited to 25 FPS, representing only approximately 25% of our SiamPT’s rapid processing capabilities. Additionally, the model size of ParallelTracker is substantially larger at 47.57 MB, surpassing our model by around 15 MB. This comparative analysis underscores that ParallelTracker, with its slower processing speed and larger model size, may not be well-suited for tasks requiring high-speed processing, such as those encountered in UAV applications. In contrast, our model, SiamPT, not only boasts superior speed but also features a more compact model size, aligning it with the stringent requirements of tasks involving UAVs. The overall assessment reinforces that SiamPT adeptly strikes a balance between tracking accuracy and swift inference, making it well-suited for high-speed processing demands in UAV-related scenarios.

As shown in Figure 7, it illustrates that our tracker can effectively utilize the prompters to handle the challenging situations of camera motion, aspect ratio change, and full occlusion. Specifically, for the motion aspect ratio change, our SiamPT outperforms the second-best algorithm by 4% in success rate and 1.6% in precision. For the camera, our SiamPT outperforms it by 2.3% in success rate and 1.8% in precision. For the full occlusion, our SiamPT outperforms it by 3% in success rate and 0.6% in precision.

As shown in Figure 8, it depicts that our tracker can successfully handle scenarios like scale variation, viewpoint change, and low resolution. Similarly, for the scale variation, viewpoint change, and low resolution, our SiamPT outperforms the second-best algorithm by 2.4%, 1.1%, 4.9% in success rate, and 1.7%, 0.8%, 1.8% in precision, respectively.

Notably, among the evaluated trackers, TransT and TrDiMP demonstrate secondary performance in terms of tracking success rate and precision, with other Siamese trackers ranking behind them. For instance, TransT achieves success rates of 0.667, 0.706, and 0.542 on attributes such as scale variation, viewpoint change, and low resolution. Similarly, TrDiMP attains success rates of 0.657, 0.689, and 0.539 for these attributes. The primary reason behind this performance difference lies in TransT’s utilization of the original Transformer structure, and TrDiMP’s incorporation of a temporal algorithm enabling adaptive learning of environmental variations. However, both TransT and TrDiMP fall short in further enhancing information interaction compared to our approach.

As demonstrated in Figure 9, it depicts that our tracker does not have the leading performance in scenarios like illumination variation, fast motion, and out-of-view. It is reasonable that the prompters are generated in the limitation of local attention which may not be beneficial for the global perspective.

Parameters analysis. Table 1 presents the parameters of our proposed SiamPT which are 31.83 M whereas the CNN backbone ConvNeXt-V1, Transformer neck, and tracking head account for 12.33 M, 6.8 M, and 12.8 M parameters, respectively. In contrast to the comparative algorithms, our proposed model not only boasts a reduced number of model parameters but also attains superior performance. The result indicates that our approach is better suited for deployment on edge devices, particularly in the context of high-precision UAV scene tracking missions.

5.1.2. UAV20L benchmark

The UAV20L benchmark, which consists of 20 lengthy sequences, with nearly 3000 frames per sequence on average, is designed as the long-term tracking benchmark that faces more complex real-world UAV tracking scenarios. In our experiments on the UAV20L benchmark, we compare our algorithm, SiamPT, with several other state-of-the-art trackers, including SiamITL, SiamFC++, SiamBAN, SiamAPN++ [47], SiamCAR, and SESiamFC (Following the same comparison with SiamITL). As illustrated in Table 2, it’s evident that our algorithm stands out with the highest performance in terms of both success rate and accuracy in the challenging long-term tracking scenario.

In the long-term tracking scenario, challenges like background interference can readily induce tracking drift, and once the target is lost, reacquiring it becomes a formidable task. Consequently, a tracker must possess a robust capability to discern the target amidst intricate backgrounds. In contrast to comparative algorithms, our SiamPT introduces the FDM and PGM to explicitly enhance the anti-jamming proficiency of trackers, culminating in optimal performance within long-term tracking scenarios.

5.1.3. GOT-10k Benchmark

GOT-10k [41] stands as a comprehensive benchmark dataset designed for tracking a wide range of objects. This extensive dataset comprises over 10,000 video sequences, featuring a staggering 1.5 million annotated bounding boxes. In our evaluation, we designate the UAV dataset as a specialized model, while GOT-10K serves as the general model. This enables us to prove the generalization capabilities of our method across a wide array of tracking scenarios. We employ the Area Overlap (AO) and Success Rate (SR) metrics for this assessment. The AO metric measures the degree of overlap between the bounding boxes and the ground-truth annotations. The SR metric quantifies the percentage of frames where the overlap surpasses specified thresholds. The training process is only conducted on the GOT-10k dataset. As demonstrated in Table 3, our algorithm exhibits exceptional performance, achieving the highest AO and SR. Especially for AO, we surpass the baseline TransT by 8%.

5.1.4. RPN Benchmarks

As demonstrated in Table 4, it shows that our tracker has the leading performance compared with traditional RPN benchmarks. Specifically, our tracker outperforms SiamRPN++ and SiamRPN 13.8% and 24.6% in success. The experimental results are reasonable. The Siamese trackers, by default, do not employ additional measures to distinguish between target objects and background interference, which is an essential aspect in object tracking. In our proposed methodology, the integration of the FDM and the PGM explicitly addresses this critical challenge.

5.2. Visualization

Illustrated in Figure 10, SiamPT has an excellent performance on the UAV video sequences. Specifically, concerning small targets in the first and second rows, the PGM module dynamically extracts valuable information from global contexts, enhancing SiamPT’s ability to discern targets from interference. This adaptability enables SiamPT to effectively handle small-scale variations.Furthermore, when confronted with fast-moving targets depicted in the third and fourth rows, SiamPT excels in tracking their rapid motion. The FDM module plays a pivotal role in distinguishing local features, thereby filtering out background interference. This mechanism ensures SiamPT’s robust performance, demonstrating its prowess in tracking dynamic scenes with agility. Overall, the embedded integration of PGM and FDM modules in SiamPT enables it to address various challenges in UAV tracking scenarios.

5.3. Limitations and Future Work

While our proposed SiamPT has improved tracking performance, it does have certain limitations. The most notable issue is the absence of exploration into template updating techniques. In future work, our emphasis will be on introducing the template mechanism or exploring various prompting realization methods. Additionally, incorporating a multi-scale attention mechanism may prove beneficial for tracking small targets, such as UAVs.

5.4. Ablation Study

In Table 5, our proposed two components impart substantial enhancements in contrast to the baseline, which comprises PGM and FDM. The baseline is the modified Transt in a lightweight manner. The incorporation of the PGM with the baseline induces an improvement of 0.4%. Based on the FDM, however, just applying this module to tracking tasks causes a slight performance decrease. By combining the PGM and FDM, the overall performance can demonstrate a remarkable increase. It is reasonable because the FDM may fall into the wrong dividing groups and the attention-based ranking strategies introduce a large computation cost. By combining them, the prompters can be used to enhance the discrimination ability in each group and finally improve the tracking performance.

6. Conclusions

In this paper, we propose a fast and efficient tracking pipeline tailored for Unmanned Aerial Vehicle (UAV) videos. Our approach pivots the concept of self-prompting, designed to strike a balance between the accuracy of tracking and the inference speed, without the demand for additional multi-model inputs. A distinctive aspect of our strategy is its adept utilization of tokens within the search region, harnessing them to autonomously generate prompters. This search region is thoughtfully segmented into several groups, each prompted by our designed prompt. To the best of our knowledge, our work is the first attempt to generate the prompter in an adaptive way from the features themselves. Our algorithm can also be regarded as a new approach to combine local and global attention. Notably, our prompter generation module offers versatility across various vision applications, extending its utility to other vision tasks. For example, current tasks with attention mechanisms can use our prompter generation to enhance their performance without too much computation cost. Our model achieves state-of-the-art performance and inference speed on UAV tracking benchmarks, proving its effectiveness.

Author Contributions

Conceptualization, Z.W.; methodology, Z.W. and J.Y.; software, Z.W. and G.Z.; validation, Z.W.; formal analysis, Z.W.; investigation, Z.W.; resources, G.Z. and J.Y.; data curation, G.Z. and J.Y.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W. and G.Z.; visualization, Z.W.; supervision, J.Z.; project administration, Q.B. and Q.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data are available from the corresponding author upon request. The data are not publicly available due to [further research in the future].

Conflicts of Interest

The authors declare no conflict of interest.

References

Choi, J.; Yeum, C.M.; Dyke, S.J.; Jahanshahi, M.R. Computer-aided approach for rapid post-event visual evaluation of a building façade. Sensors 2018, 18, 3017. [Google Scholar] [CrossRef] [PubMed]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677. [Google Scholar]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 771–787. [Google Scholar]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4310–4318. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6728–6737. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
Wang, Z.; Yao, J.; Tang, C.; Zhang, J.; Bao, Q.; Peng, Z. Information-diffused graph tracking with linear complexity. Pattern Recognit. 2023, 143, 109809. [Google Scholar] [CrossRef]
Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 146–164. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Zhu, J.; Lai, S.; Chen, X.; Wang, D.; Lu, H. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9516–9526. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
Deng, A.; Han, G.; Chen, D.; Ma, T.; Liu, Z. Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking. Remote Sens. 2023, 15, 2857. [Google Scholar] [CrossRef]
Li, S.; Fu, C.; Lu, K.; Zuo, H.; Li, Y.; Feng, C. Boosting UAV tracking with voxel-based trajectory-aware pre-training. IEEE Robot. Autom. Lett. 2023, 8, 1133–1140. [Google Scholar] [CrossRef]
Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783. [Google Scholar]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
Martin, D.; Goutam, B.; Gladh, S.; Khan, F.S.; Felsberg, M. Deep motion and appearance cues for visual tracking. Pattern Recognit. Lett. 2019, 124, 74–81. [Google Scholar] [CrossRef]
Danelljan, M.; Gool, L.V.; Timofte, R. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7183–7192. [Google Scholar]
Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–799. [Google Scholar]
Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13444–13454. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8731–8740. [Google Scholar]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
Fu, Z.; Fu, Z.; Liu, Q.; Cai, W.; Wang, Y. SparseTT: Visual Tracking with Sparse Transformers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 905–912. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3086–3092. [Google Scholar]

Figure 1. Comparison of different tracking pipelines of Transformer-based trackers. (a) is the conventional tracking pipeline. (b) is the novel prompting tracking pipeline (CVPR2023). (c) is our proposed tracking pipeline with a self-prompting mechanism.

Figure 2. The overview of SiamPT with the CNN backbone, Transformer neck, and Tracker head stages. An efficient ConvNeXt backbone is used as the foundation of our tracker. Within the Neck block, the Prompter Generation Module (PGM) is responsible for the extraction of prompters from the global attention mechanism. The Feature Division Module (FDM) is to categorize tokens into different classifications, ensuring the ability to distinguish targets and background interference.

Figure 3. The overview of Transformer Neck. It includes the details of our proposed FDM and PGM. The different colors denote the different clustering regions in the feature map.

Figure 4. The overview of attention structure in SiamPT where the Multi-Head Attention (MHA), Feed-Forward Network (FFN), Feature Division Module (FDM), and the Prompter Generation Module (PGM) are involved.

Figure 5. The process to generate the prompter. The local attention and global attention can guide each other to form a prompter. Similarly, the different colors denote the different clustering regions in the feature map.

Figure 6. Comparison on UAV123 dataset in overall performance. It indicates the success rate and precision of our proposed SiamPT.

Figure 7. Comparison on UAV123 dataset in different attributes. Camera Motion, Aspect Ratio Change, and Full Occlusion are included.

Figure 8. Comparison on UAV123 dataset in different attributes. Scale Variation, Viewpoint Change, and Low Resolution are included.

Figure 9. Comparison on UAV123 dataset in different attributes. Illumination Variation, Fast Motion, and Out-of-View are included. Especially, in these attributes, our model is not in the leading position.

Figure 10. The visualization outcomes conducted on UAV123 dataset.

Table 1. The overall performance compared with the current UAV-benchmark trackers including the inference speed, platform, and parameters.

	SiamSTM	SiamITL	ParallelTracker	Ours
Success Rate	0.618	0.625	0.692	0.694
Precision	0.809	0.818	0.905	0.890
Inference Speed (FPS)	193	32	25	91
Platform (GPU)	RTX3090	RTX3090	RTX2070	RTX3090
Parameters (MB)	31.1	65.4	47.6	32.8

Table 2. The overall performance conducted on UAV20L benchmark.

	SiamITL	SiamFC++	SiamBAN	SiamAPN++	SiamCAR	SESiamFC	Ours
Success	0.588	0.575	0.564	0.533	0.523	0.453	0.653
Precision	0.769	0.742	0.736	0.703	0.687	0.648	0.848

Table 3. The overall performance conducted on GOT-10k benchmark.

	SparseTT	TransT	DTT	TrDiMP	DiMP	SiamR-CNN	Ours
AO	0.693	0.671	0.634	0.671	0.611	0.649	0.725
SR0.5	0.791	0.768	0.749	0.777	0.717	0.738	0.827
SR0.75	0.638	0.609	0.514	0.583	0.492	0.597	0.670

Table 4. The overall performance compared with the current RPN benchmarks.

	SiamRPN	SiamRPN++	SiamBAN	SiamCAR	SiamSTM	Ours
Success	0.557	0.610	0.631	0.614	0.647	0.694
Precision	0.710	0.752	0.833	0.760	—	0.890

Table 5. The ablation study conducted on UAV123 benchmark. The × and ✓ denote that our network has or does not have this component.

NO.	PGM	FDM	Overall (SR)	Inference Speed (FPS)	Model Size in Neck (MB)
1	×	×	0.680	122.1	6.3
2	×	✓	0.677	98.2	-
3	✓	×	0.684	118.0	-
4	✓	✓	0.694	91.0	6.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Zhou, G.; Yao, J.; Zhang, J.; Bao, Q.; Hu, Q. Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos. Remote Sens. 2024, 16, 748. https://doi.org/10.3390/rs16050748

AMA Style

Wang Z, Zhou G, Yao J, Zhang J, Bao Q, Hu Q. Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos. Remote Sensing. 2024; 16(5):748. https://doi.org/10.3390/rs16050748

Chicago/Turabian Style

Wang, Zhixing, Gaofan Zhou, Jinzhen Yao, Jianlin Zhang, Qiliang Bao, and Qintao Hu. 2024. "Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos" Remote Sensing 16, no. 5: 748. https://doi.org/10.3390/rs16050748

APA Style

Wang, Z., Zhou, G., Yao, J., Zhang, J., Bao, Q., & Hu, Q. (2024). Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos. Remote Sensing, 16(5), 748. https://doi.org/10.3390/rs16050748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Prompting Tracking: A Fast and Efficient Tracking Pipeline for UAV Videos

Abstract

1. Introduction

2. Related Works

3. Preliminary

4. Methods

4.1. CNN-Based Backbone

4.2. Self-Prompting Realization

4.2.1. Encoder-Decoder Pipeline

4.2.2. Feature Division Module (FDM)

4.2.3. Prompter Generation Module (PGM)

4.2.4. Prompter Integration Strategy (PIS)

4.3. Double Head Layer

5. Experimental Results and Analysis

5.1. State-of-the-Arts Comparison

5.1.1. UAV123 Benchmark

5.1.2. UAV20L benchmark

5.1.3. GOT-10k Benchmark

5.1.4. RPN Benchmarks

5.2. Visualization

5.3. Limitations and Future Work

5.4. Ablation Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI