Open AccessArticle

Autonomous UAV Chasing with Monocular Vision: A Learning-Based Approach

Yuxuan Jin

^1,2

Tiantian Song

³,

Chengjie Dai

^1,2

Ke Wang

^1,2

and

Guanghua Song

^1,2,*

The School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China

Huanjiang Laboratory, Zhuji 311800, China

The Department of Statistical Science, University College London, London WC1E 6BT, UK

Author to whom correspondence should be addressed.

Aerospace 2024, 11(11), 928; https://doi.org/10.3390/aerospace11110928

Submission received: 14 October 2024 / Revised: 6 November 2024 / Accepted: 8 November 2024 / Published: 9 November 2024

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Figure 1
The leader-follower system. "> Figure 2
The monocular camera model. "> Figure 3
The workflow of the learning-based unmanned aerial vehicle (UAV) chasing control method. "> Figure 4
Obtaining fixed-wing UAV target from the image captured by monocular camera. Since the area occupied by the target fixed-wing UAV is relatively small in the images, we have enlarged it to enhance visibility. "> Figure 5
Forward propagation of the vision follow net (VFNet). Parameters in the green rectangles are trainable. "> Figure 6
The detailed architecture of the learning hidden unit contributions (LHUC) algorithm. "> Figure 7
The multi-head self-attention architecture used in the waypoint prediction module. "> Figure 8
The leader-follower formation in the landing state is shown in the Gazebo simulator, with the fixed-wing UAV in a yellow box and the quadrotor UAV in an orange box. "> Figure 9
The training loss curve. "> Figure 10
The results of the waypoint prediction experiment. A smaller absolute value of deviation indicates better performance. "> Figure 11
The trajectories of straight flight. "> Figure 12
The trajectories of clockwise spiral flight. "> Figure 13
The trajectories of counterclockwise spiral flight. "> Figure 14
The trajectories of turning from straight flight to an counterclockwise spiral flight. "> Figure 15
The trajectories of turning from clockwise flight to counterclockwise spiral flight. "> Figure 16
The trajectories of long-term flight. In figure, the green line denotes the trajectory of the leader fixed-wing UAV, while the red line represents that of the follower quadrotor UAV. "> Figure 17
The results of waypoint prediction experiments with VFNet after removing the ResNet component. The smaller absolute value of deviation indicates better performance. ">

Versions Notes

Abstract

In recent years, unmanned aerial vehicles (UAVs) have shown significant potential across diverse applications, drawing attention from both academia and industry. In specific scenarios, UAVs are expected to achieve formation flying without relying on communication or external assistance. In this context, our work focuses on the classic leader-follower formation and presents a learning-based UAV chasing control method that enables a quadrotor UAV to autonomously chase a highly maneuverable fixed-wing UAV. The proposed method utilizes a neural network called Vision Follow Net (VFNet), which integrates monocular visual data with the UAV’s flight state information. Utilizing a multi-head self-attention mechanism, VFNet aggregates data over a time window to predict the waypoints for the chasing flight. The quadrotor’s yaw angle is controlled by calculating the line-of-sight (LOS) angle to the target, ensuring that the target remains within the onboard camera’s field of view during the flight. A simulation flight system is developed and used for neural network training and validation. Experimental results indicate that the quadrotor maintains stable chasing performance through various maneuvers of the fixed-wing UAV and can sustain formation over long durations. Our research explores the use of end-to-end neural networks for UAV formation flying, spanning from perception to control.

Keywords:

vision-based control; leader-follower formation; flight waypoints prediction

1. Introduction

Unmanned aerial vehicles (UAVs) have demonstrated substantial potential and hold promising applications in fields such as surveillance, search and rescue, agriculture, and national defense [1,2,3], attracting considerable attention from both academia and industry. Specifically, UAV formations have become a research hotspot. Current approaches for UAV formations mostly rely on cooperative frameworks, where information exchange between UAVs facilitates coordinated flight and collaborative tasks [4,5,6]. However, in complex environments with strong interference, collaborative methods like radio communication are susceptible to disruption and make cooperation unreliable. Furthermore, as UAVs have become more accessible in recent years, the probability of their misuse has grown, leading to heightened security concerns over malicious flights [7,8,9]. Sometimes, there is a necessity to track or apprehend intruding UAVs under non-cooperative conditions [10,11]. These non-cooperative scenarios require UAVs to autonomously perceive and analyze information from the environment and generate formation flight control strategies without external assistance. As a typical formation pattern, the leader-follower formation becomes the focus of our research.

Research on leader-follower formation control for UAVs in cooperative environments has made significant strides. Given the flight state of the leader UAV is known, many studies have focused on the theoretical aspects of mathematical modeling for the system. A variety of control methods have been proposed, including PID control [12], backstepping control [13], sliding mode control [14], and hybrid control that integrate neural networks [15], aimed at achieving stable formation control in both disturbed and undisturbed conditions. Additionally, a number of studies have utilized the collaborative relationships among UAVs, employing techniques like advanced radio [16] and special markers [17] to gather extra information for formation control. However, these methods become impractical in non-cooperative scenarios, as they rely on conditions that are not met in such contexts.

Visual perception is a crucial means for UAVs to acquire information. By capturing images of targets using the onboard camera, UAVs can autonomously navigate and accomplish specific tasks. For stationary and moving targets on the ground, a number of studies have been carried out. For example, works such as [18,19,20,21] have extensively explored UAVs tracking ground targets and landing on moving targets on land or water. However, research on aerial targets remains limited due to the increased complexity of their motion, as they have greater degrees of freedom compared to ground targets, which introduces larger tracking difficulties. Existing research predominantly focuses on quadrotors as aerial targets, whose maneuverability is relatively constrained, with only limited exploration of other aircraft types. Additionally, their methods typically treat visual perception and UAV flight control as distinct processes, allowing researchers to separately take advantage of advancements in both visual target detection and UAV control. For instance, Feng Lin et al. developed a vision-based leader-follower UAV formation flying system [22]. Their system utilized a known camera model and geometric information of the leader to calculate the relative distance and direction of the leader with respect to the follower. Under the quasi-steady-state assumption, it computed the velocity and acceleration of the leader, guiding the follower UAV in flight. Xuancen Liu et al. employed the Kernelized Correlation Filters (KCF) algorithm to detect and track the target in real-time [23]. They equipped a rotary-wing UAV with a three-axis gimbaled camera, precisely directed at the target, thereby accomplishing follow-flight through a proportional-guidance tracking strategy. Particularly, with the rapid advancement of target detection based on deep neural networks (DNNs), there is an enhanced capability to identify aerial targets from images more effectively. Donghee Lee et al. developed a visual tracker comprising an adaptive search region (SR) and a fully convolutional neural network (FCNN) to obtain the precise location of a target UAV in images [24]. They successfully controlled a micro UAV to follow a target UAV using this tracker. Kyubin Kim et al. also proposed a UAV tracking system consisting of a UAV tracker and a control signal generator [25]. The tracker utilized the YOLOv3 object detection network for target identification. Ye Zheng et al. [26] introduced the Det-Fly dataset and evaluated the performance of eight representative deep-learning algorithms for air-to-air drone detection. Building on this, Jianan Li et al. [27] proposed a new pseudo-linear Kalman filter and a novel 3-D helical guidance law to enable a quadrotor UAV to pursue another. However, in these two-stage methods, the visual perception and UAV control algorithms operate in isolation, with neural networks having no involvement in controlling UAV flight. As a result, the UAV control algorithms only receive basic image information from the outputs of the visual perception algorithms, such as the target’s position and size within the image, missing the opportunity to utilize the semantic information extracted by the neural networks. Fabian Schilling et al. [28] proposed an entirely visual approach for coordinating markerless UAV swarms based on the DAgger imitation-learning algorithm. Their method is one-stage but requires UAVs to be equipped with up to six cameras to provide omnidirectional vision, which makes it challenging to apply to most existing UAVs.

In this context, we employ a common monocular camera mounted on a quadrotor UAV as the image acquisition sensor and propose a learning-based visual chasing flight control method for UAVs in non-cooperative scenarios, eliminating the need for radio communication or other external assistance. Taking a highly maneuverable fixed-wing UAV as the aerial target, our approach allows the quadrotor UAV to follow its flight, establishing a leader-follower formation. The main contributions of our work are outlined below:

We develop a novel end-to-end deep neural network model called Vision Follow Net (VFNet) that integrates multi-source data, including images captured by the onboard monocular camera and the flight state of the quadrotor UAV. By employing a multi-head self-attention mechanism, VFNet aggregates information over a temporal window to predict the waypoints needed for the quadrotor’s chasing flight. Additionally, by calculating the line-of-sight (LOS) angle to the target, our method controls the yaw angle of the quadrotor, ensuring that the target remains within the field of view of the onboard camera during formation flying.
We implement a simulation flight system with a fixed-wing UAV and a quadrotor UAV, forming a leader-follower formation. By conducting flight simulations, we validate the effectiveness of the proposed method. The experimental results demonstrate that our method allows UAVs to execute sharp maneuvers and achieve stable chasing flights over long periods and distances. Additionally, ablation studies show that the neural network, through the extraction and learning of deep features from target images, markedly improves the performance of waypoint predictions.

2. Problem Statement

2.1. Formation Model

As depicted in Figure 1, a fixed-wing UAV serves as an aerial target and is chased by another quadrotor UAV, forming a leader-follower system. As the quadrotor operates as the follower within this system, we also refer to it as the follower UAV in the subsequent sections. The flight behavior necessary for the quadrotor to sustain the leader-follower formation is designated as ’chase’, as illustrated by the term ’chasing flight’. The quadrotor is equipped with a monocular camera for real-time RGB image acquisition. To prevent aerial collisions, we set a safety height distance, allowing the quadrotor UAV to fly lower than the fixed-wing UAV.

Given the absence of special aerodynamic designs, the fixed-wing aircraft requires forward airspeed to maintain lift. Consequently, various maneuvers of the fixed-wing UAV in our research, such as turns and spirals, involve the fundamental motion of forward flight. In order to continuously chase the fixed-wing UAV with high mobility, the quadrotor UAV should sustain forward flight and perform corresponding maneuvaers. Due to the dynamic characteristics of quadrotor UAV, its nose typically tilts downward during forward flight. If the onboard camera mounted on the quadrotor is aligned parallel to the forward direction of its nose, the captured area during chasing flight will also tilt downward, leading to a reduction in the camera’s field of view covering the target’s flight area. To address this issue, we set the camera’s capture direction to be inclined upward at a 30-degree angle relative to the forward direction of the nose.

2.2. Coordinate Systems and Symbols

Figure 1 also illustrates some coordinate systems involved in our leader-follower system. Let

B (O_{b}

X_{b} Y_{b} Z_{b})

represent the UAV body coordinate system, and

M (O_{m}

X_{m} Y_{m} Z_{m})

denotes the camera body coordinate system. Both frames are defined as the Front-Right-Down (FRD) frame and are affixed to the UAV body, where the X-axis points forward, the Y-axis points to the right, and the Z-axis points downward. In frame B, the

X_{b}

axis points forward, the

Y_{b}

axis points to the right of the aircraft, and the

Z_{b}

axis points downwards. Due to the camera’s mounting angle, frame M is positioned after a clockwise rotation of 30 degrees around the

Y_{b}

axis. Frame

N (O_{n}

X_{n} Y_{n} Z_{n})

serves as the navigation coordinate system, adopting the North-East-Earth (NED) coordinate system and functioning as an inertial reference frame, where the

X_{n}

points north, the

Y_{n}

points east, and the

Z_{n}

points downward. The image coordinate system, denoted by frame

I (u O_{i} v)

, originates at

O_{i}

in the top-left corner of the image. The u and v axes are aligned parallel to the rows and columns of the image, respectively, measured in pixels.

The monocular camera model is shown in Figure 2. Frame

C (O_{c}

X_{c} Y_{c} Z_{c})

defines the camera coordinate system, originating at the optical center. The

Z_{c}

axis points forward from the camera, and the

X_{c}

and

Y_{c}

axes are parallel to the u and v axes of the image coordinate system. The camera focal length is denoted as f.

3. Proposed Method

In this section, we introduce our learning-based UAV chasing control method, including target detection, the angular geometry calculator, and the VFNet.

3.1. Workflow

Considering the underactuated nature of the quadrotor UAV [23], we only need to directly track the desired yaw angles and waypoints to control its flight. The flight controller will then determine the desired roll angles and pitch angles by the known ones. Our method utilizes images captured by the onboard camera and quadrotor UAV’s flight state data to derive these necessary control inputs.

Figure 3 illustrates the entire workflow of our proposed method. At any given moment, the monocular camera of the quadrotor UAV captures an image containing the aerial target. The flight controller measures the velocity

V = [V_{x}, V_{y}, V_{z}]

and three attitude angles—pitch

θ

, yaw

ψ

, and roll

ϕ

—in the navigation coordinate system, reflecting the current flight state. The image is processed through a target-detection algorithm then provides the position and size of the target in the image coordinate system. As this information enables the creation of a bounding box around the target, it is referred to as bounding box data. To extract semantic features, a region containing the target is excised from the original image, referred to as the clipped image. It has a fixed side length of 224 pixels. Processed from the clipped image, along with the bounding box and flight state data, the VFNet infers the waypoint of the quadrotor in the subsequent moment. Simultaneously, the angular geometry calculator uses bounding box data and flight attitude angles to compute the desired yaw angle for the UAV. Finally, the waypoint and the desired yaw angle are transmitted to the flight controller, which employs a cascaded control architecture that integrates Proportional (P) and Proportional-Integral-Derivative (PID) controllers, ultimately governing the flight of the quadrotor UAV [29]. A detailed explanation will be provided in the following paragraphs.

3.2. Obtaining Target from the Image

Target detection is a fundamental task in computer vision, and a variety of powerful algorithms can effectively address this task, especially with the help of deep learning. Since detecting UAV targets in images is not the primary focus of our method, we have adopted a common bounding box data format to convey the outcomes of target detection. This format comprises a four-dimensional vector

[x, y, w, h]

, where x and y denote the pixel coordinates of the target’s center in the image coordinate system, and w and h represent the pixel width and height of the target, respectively. Any target-detection algorithm that produces output in this format is compatible with our method.

To simplify our work, we implemented an algorithm based on OpenCV to acquire the UAV targets from images. When the fixed-wing UAV is flying in a desolate scene, the target exhibits a noticeable contrast with the background, enabling effective separation through image binarization. As shown in Figure 4, the process begins with the conversion of the RGB raw image to grayscale. Next, upper and lower thresholds are set for image binarization. Finally, the bounding box data can be obtained by extracting target contours, filtering based on area, and fitting bounding boxes.

3.3. The Angular Geometry Calculator

After acquiring the target’s pixel coordinates in the image, we use the monocular camera model depicted in Figure 2 to calculate the LOS angle. By incorporating the known camera mounting angles and flight attitude of the quadrotor, the relative orientation of the fixed-wing UAV with respect to the quadrotor UAV can be determined. This allows us to obtain the desired yaw angle for quadrotor. All computation processes take place within the angular geometry calculator, with the primary steps outlined as follows:

3.3.1. Calculate LOS Angle

Let the coordinates of the fixed-wing UAV target T in the camera frame C be represented as

P_{c} = {[x_{c}, y_{c}, z_{c}]}^{⊤}

, and its corresponding position in the image frame I is denoted as

(u, v)

. According to the monocular camera model, we have:

P_{c} = K^{- 1} z_{c} [\begin{matrix} u \\ v \\ 1 \end{matrix}], K = [\begin{matrix} \frac{f}{d x} & 0 & u_{0} \\ 0 & \frac{f}{d y} & v_{0} \\ 0 & 0 & 1 \end{matrix}]

(1)

Here,

d x

and

d y

denote the physical dimensions of a single pixel on the image sensor in the x and y directions, respectively.

u_{0}

and

v_{0}

represent the pixel coordinates of point

O_{i}

, where the lens axis intersects the image plane. The camera’s intrinsic matrix

K

can be acquired using Zhang’s camera calibration method [30]. In Equation (1), the unknown scalar

z_{c}

does not affect the direction of

P_{c}

. Therefore, the vector

P_{g}

of LOS in the camera frame is the unit vector of

P_{c}

3.3.2. The Azimuth Between Target and Follower

Let the LOS vector in the camera body coordinate system be denoted as

P_{m}

. According to the coordinate axis definition, the relationship between

P_{m}

and

P_{g}

is given by:

P_{m} = [\begin{matrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{matrix}] P_{g}

(2)

Given the camera’s fixed orientation on the quadrotor UAV, denoted by angles

ψ_{m}

θ_{m}

, and

ϕ_{m}

, we can derive the rotation matrix

R_{m}^{b}

to transform from the camera body frame M to the UAV body frame B. Similarly, by using the aircraft attitude angles

ψ

θ

, and

ϕ

, we can obtain the transformation matrix

R_{b}^{n}

of the frame B with respect to navigation frame N. Therefore, the vector

P_{n} = {[x_{n}, y_{n}, z_{n}]}^{⊤}

representing the azimuth of the target T relative to the quadrotor UAV is:

P_{n} = R_{b}^{n} R_{m}^{b} P_{m}

(3)

To align the quadrotor UAV with the target, the anticipated yaw angle

ψ_{e x p}

is as follows:

ψ_{e x p} = atan2 (\frac{y_{n}}{x_{n}})

(4)

3.4. Vision Follow Net

VFNet is tasked with predicting waypoints for the quadrotor during the chasing flight. It extracts features from three types of information: the flight state data of the quadrotor UAV, the bounding box data, and the clipped images. These features are then integrated to determine the offsets of the next moment’s waypoint relative to the quadrotor aircraft’s current position.

3.4.1. Data Preprocessing

Data from different sources exhibit varying numerical features. Directly taking them as inputs for the VFNet may lead to difficulties in model training and performance degradation. Therefore, it is necessary to preprocess the data through transformation.

Bounding box data: We normalize the pixel coordinates of the target center (x) and pixel width (w) by dividing them by the image width. Similarly, the pixel coordinates of the target center (y) and pixel height (h) are normalized by dividing them by the image height.
Clipped images: Each channel of the image undergoes a process of subtracting its mean and dividing by its standard deviation, resulting in normalized image data with a mean of 0 and a standard deviation of 1 for each channel.
Flight state data: It consists of the quadrotor’s velocity and attitude angles. The velocity $V_{n} = {[V_{x n}, V_{y n}, V_{z n}]}^{⊤}$ and attitude angles from the flight controller are represented in the navigation frame. We transform them into the UAV body frame and normalize them, as outlined in Equation (5). Here, $V_{b} = {[V_{x b}, V_{y b}, V_{z b}]}^{⊤}$ represents the unit vector of the quadrotor’s flight velocity in the UAV body frame, and $R_{n}^{b}$ can be obtained from angles $ψ$ , $θ$ , and $ϕ$ . $V_{m a x}$ represents the known maximum flight speed of the quadrotor, while $V_{a b s}$ signifies the normalized current flight speed of the quadrotor.

\{\begin{matrix} V_{b} = \frac{R_{n}^{b} V_{n}}{\sqrt{V_{x n}^{2} + V_{y n}^{2} + V_{z n}^{2}}} \\ V_{a b s} = \frac{\sqrt{V_{x n}^{2} + V_{y n}^{2} + V_{z n}^{2}}}{V_{m a x}} \end{matrix}

(5)

3.4.2. Feature Extraction and Aggregation

As illustrated in Figure 5, the VFNet architecture is primarily divided into two components: the feature extraction and aggregation module, and the self-attention-based waypoint prediction module.

In this module, features are first extracted from the three types of input information mentioned above.

Embedding by linear layers: Flight state data and bounding box data are processed through linear layers with ReLU as the activation function, followed by layer normalization, resulting in 16-dimensional embeddings for each. Through a concatenation operation, these two embeddings are combined into a 32-dimensional vector, which encodes the follower UAV’s “personalized” flight information as well as its observational perspective on the fixed-wing UAV.
Embedding by ResNet: The clipped images are processed through a deep residual network (ResNet) [31] with 18 layers, producing a 32-dimensional embedding containing deep semantic information. This embedding primarily reflects the “objective” flight state of the fixed-wing UAV as observed by the follower.

Then, the features are aggregated. Inspired by the learning hidden unit contributions (LHUC) algorithm proposed in the field of speech recognition [32] and its application in recommendation systems [33], we introduced it to aggregate the “personalized” and “objective” features. As Figure 6, the follower’s “personalized” features are fed into a Gate Neural Unit composed of two layers. The first layer is a linear layer with ReLU activation, containing trainable weights W and biases b, which is used to perform intra-feature interactions to better capture the complex influences among personalized features. The output from the first layer is processed by trainable weights W’ and biases b’, then passed through a sigmoid function in the second layer to produce a 32-dimensional gate vector

δ

, with its scale controlled by a hyperparameter

γ

to range within [0,

γ

]. We set

γ

to 2, allowing the gate vector to amplify or attenuate the controlled information. Finally,

δ

is combined with the image features using a Hadamard product, an operation that takes two vectors of the same dimensions and produces a new vector in which each element is the product of the corresponding elements of the input vectors.

z_{i} = δ_{i} \times y_{i}

(6)

The final output

z

is a personalized representation of the fixed-wing UAV’s flight state, referred to as observational features.

3.4.3. Self-Attention-Based Waypoint Prediction

The second module is responsible for predicting waypoints using observational features. As the flight state of a UAV does not undergo abrupt changes over short durations, for predicting the waypoints of the next moment, the information acquired at present is crucial, but accumulated information from previous moments can also provide valuable insights. To this end, we apply a multi-head self-attention unit [34]. This mechanism allows the neural network to focus on different parts of the input sequence. By building upon the basic self-attention framework, it introduces multiple attention heads that compute attention weights in parallel, capturing diverse perspectives of the data. Each head highlights different aspects, and by integrating these, the network gains a more comprehensive representation, improving its ability to understand complex relationships.

As shown in Figure 7, in VFNet, the observational features from the current and two previous time steps are fed into a multi-head self-attention unit with four heads, enabling the network to capture changes in the follower UAV’s flight state and the tracked target over a short temporal window, thereby enhancing its understanding of the flight trend. These features are transformed into query, key, and value vectors via weight matrices

W_{q}

W_{k}

, and

W_{v}

, and then organized into query (Q), key (K), and value (V) matrices. The matrix product of the query and key matrices is computed to derive attention scores. These scores are then scaled and normalized through a SoftMax function, producing attention weights, as shown in Equation (7), where

d_{k}

is the dimension of the key vector. The attention weights measure the importance of information for future flight and are used to perform a weighted sum of the value matrix, resulting in the attention output.

A t t e n t i o n W e i g h t s = S o f t M a x (\frac{Q K^{⊤}}{\sqrt{d_{k}}})

(7)

In addition, to strengthen the network’s focus on the present flight, we concatenate the current observational features with the flattened output of the attention unit. Subsequently, it passes through a linear layer with a ReLU activation function, producing the final output: the normalized offset of the next waypoint relative to the current position of quadrotor in the UAV body frame, as outlined in Equations (8) and (9):

W a y p o i n t o f f s e t s_{n o r m a l i z e d} = {[{\hat{x}}_{n b}, {\hat{x}}_{n b}, {\hat{x}}_{n b}, R_{n}]}^{⊤}

(8)

R_{n} = \frac{\sqrt{{\hat{x}}^{2} + {\hat{y}}^{2} + {\hat{z}}^{2}}}{R_{m a x}}

(9)

Here,

{\hat{x}}_{n b}

{\hat{y}}_{n b}

, and

{\hat{z}}_{n b}

, respectively, represent the normalized waypoint offset values along the

X_{b}

Y_{b}

, and

Z_{b}

axes.

R_{n}

denotes the normalized magnitude of the waypoint offset vector, while

R_{m a x}

stands for the predetermined maximum reference value for the magnitude of the waypoint offset. To obtain the waypoint offset

[\hat{x}, \hat{y}, \hat{z}]

in the navigation frame, we perform the waypoint transformation as depicted in Equation (10):

\{\begin{matrix} R = R_{n} R_{m a x} \\ \hat{x} = R {\hat{x}}_{n b} \\ \hat{y} = R {\hat{y}}_{n b} \\ \hat{z} = R {\hat{z}}_{n b} \end{matrix}

(10)

where R represents the magnitude of the predicted waypoint offset. Ultimately, the waypoint for the next moment is as follows:

\{\begin{matrix} w a y p o i n t_{x} = \hat{x} + x_{n} \\ w a y p o i n t_{y} = \hat{y} + y_{n} \\ w a y p o i n t_{z} = \hat{z} + z_{n} \end{matrix}

(11)

3.4.4. Simulation Flight System and Neural Network Training

As a deep neural network, VFNet develops its predictive capability through training. However, collecting the necessary data in the real world poses significant challenges. Therefore, we construct a simulation flight system based on the open-source physics simulator Gazebo [35], flight controller PX4 [36], and the Robot Operating System (ROS) [37]. This system simulates the leader-follower system, comprising a fixed-wing UAV as the target and a quadrotor UAV as the follower, as shown in Figure 8. Gazebo provides a realistic physical simulation environment, simulating attributes such as gravity and inertia during UAV flight, while generating various sensor data. As we do not require the follower to precisely adhere to the leader’s trajectory, developing a corresponding kinematic model to accurately control the quadrotor UAV’s attitude is not strictly necessary. Instead, we allow the PX4 flight controller to take responsibility for perceiving and fusing the necessary sensor data to compute real-time flight attitude, as well as managing the fundamental dynamics control of the UAV. ROS serves as a communication bridge, with a control node functioning as a virtual master computer. The flight controllers, along with the onboard camera, are connected to ROS, enabling data transmission to the virtual master computer. Through ROS, the virtual master computer transmits flight commands consisting of waypoints and yaw angles, effectively governing the flight behavior of the quadrotor UAV.

To obtain training data, the simulation flight system operates in two modes: one for offline training and the other for online training. In both modes, the fixed-wing UAV, serving as the leader, flies freely according to waypoints generated by the program or manually input. The quadrotor UAV, acting as the follower, controls its yaw angle based on outputs from the angular geometric calculator. The virtual master computer sends the current aerial position of the fixed-wing UAV to the quadrotor as the next waypoint. This represents an ideal leader-follower formation strategy under conditions of effective communication between the two UAVs. We employ imitation learning to train the neural network to mimic this strategy. During offline training, the virtual master computer records the quadrotor’s positional data, attitude angles, waypoint data, and images captured by the onboard camera throughout the flight, creating a dataset. In online training mode, the fixed-wing UAV is programmed to fly towards randomly generated waypoints within a reasonable area, and the data gathered during the flight is directly used for network training.

VFNet first utilizes the dataset for offline supervised training. Subsequently, to mitigate potential biases in the dataset, VFNet will undergo a period of online training. As the fixed-wing UAV remains in stable flight states, such as straight-line flight, for most of the time, with only brief intervals of high-maneuver activity, the training data presents a long-tail distribution, consisting of a large proportion of stable flight data and a small fraction of high-maneuver data. To address this, we employ the L1 loss function to fit stable flight data and the L2 loss to handle high-maneuver “outliers”. Adagrad is selected as the optimizer for the entire training process, and the loss curve is shown in Figure 9.

4. Experiments

To evaluate the effectiveness of our proposed method, we conducted a series of experiments using the simulation flight system. Furthermore, through the ablation study, we have validated that learning semantic features from clipped images significantly enhances the predictive method.

4.1. Experimental Setup

The simulation flight system is developed using ROS Noetic and Gazebo 11, coupled with PX4 autopilot version 1.13.0. In our experiments, both types of UAVs are configured with a maximum flight speed (

V_{m a x}

) of 20 m/s, while the fixed-wing UAV has a minimum speed of 10 m/s. The quadrotor UAV’s onboard camera has a resolution of 1920 × 1080, with an 80° horizontal and 45° vertical field of view. The maximum reference value for the magnitude of the waypoint offset (

R_{m a x}

) is set to be 30 m. Throughout the experiments, the ROS rate of the simulation flight system is 5 Hz.

The experimental platform utilizes a computer running the Ubuntu 20.04 operating system and is equipped with a 12-core AMD Ryzen 9 7900X CPU, 64 GB of memory, and an Nvidia RTX 4070 GPU, ensuring adequate computing resources.

4.2. Waypoint Prediction Experiment

We select a data segment that includes various maneuvers by the fixed-wing UAV, such as clockwise and counterclockwise spirals, along with changes in flight altitude, to serve as the test set. We use VFNet to predict waypoints and compare the results with the ideal waypoints under communication conditions. The deviation between the two is then calculated in the navigation coordinate system, as illustrated in Figure 10.

From the heading perspective, VFNet predictions exhibit minimal deviations from the communication approach during stable maneuvers conducted by the leader, such as continuous clockwise or counterclockwise spirals, with an absolute deviation of less than 0.8 m. However, the deviation significantly increases during sharp maneuvers, particularly during the transition from clockwise to counterclockwise spirals around the 30-s mark. Similarly, when assessing altitude, VFNet predictions show minimal deviations during minor maneuvers by the leader, but this pattern shifts during sharp maneuvers. In this test set, the maximum absolute deviation along the

Z_{n}

axis is approximately 1.10 m, while the maximum deviations along the

X_{n}

and

Y_{n}

axes are 1.66 m and 2.56 m, respectively.

Despite the variations in deviations across different maneuvering states, the overall deviations remain consistently close to zero. Statistical analysis indicates that the average deviations along the

X_{n}

Y_{n}

, and

Z_{n}

axes are 0.0535 m, 0.0640 m, and 0.0143 m, respectively. These findings suggest that the waypoint prediction results of VFNet closely align with those of the communication method, demonstrating its effectiveness in learning the formation flying strategy according to the communication approach.

4.3. Multi-Scene Chasing Test

To comprehensively assess the performance of our proposed chasing flight method, a series of multi-scene chasing tests are conducted in a simulation flight system. In addition to scenarios involving common maneuvers performed by the fixed-wing UAV, such as straight flight, clockwise, and counterclockwise spiral flights, we specifically examine its dynamic performance by testing two typical scenarios that require sharp maneuverability: a transition from straight flight to spiral flight, and rapid changes in the direction of spiral flight. All test flights are conducted with variations in flight altitude. Moreover, through long-term chasing flight experiments that encompassed various flight scenarios, we validate the stability of our method.

In this section, the coordinates of the UAV flight trajectory are represented in the navigation frame, with the UAV takeoff point serving as the origin.

4.3.1. Straight Flight

Straight flight is one of the fundamental modes of UAV flight. In this experiment, we direct the fixed-wing UAV to cover approximately 350 m in the northeast direction within a duration of about 30 s, employing our method to control the quadrotor UAV to chase it. As depicted in Figure 11, the simulation flight system record the flight trajectories of both UAVs. On the far right of the figure, a three-dimensional visualization illustrates the evolution of the UAV trajectories during flight, with the trajectory color gradually deepening as the UAVs progress. The results indicate that our method effectively accomplishes the chasing task in this scenario. Throughout this flight segment, the maximum and minimum differences in flight altitude between the fixed-wing UAV and the quadrotor UAV are 0.959 m and 0.630 m, respectively, demonstrating the ability to maintain a certain safety distance.

4.3.2. Spiral Flight

Spiral flight is a prevalent mode for fixed-wing UAVs engaged in tasks such as aerial photography and area monitoring. Unlike straight flight, the constant changes in the velocity direction of the fixed-wing UAV during spiral flight impose higher demands on the tracking capabilities of the quadrotor UAV. In our experiments, we direct the fixed-wing UAV to perform both clockwise and counterclockwise spiral flights, depicting their respective trajectories in Figure 12 and Figure 13. The trajectories demonstrate that the quadrotor UAV, under the control of our method, timely perceives variations in the target’s direction, enabling real-time chasing. In the clockwise spiral flight test, the maximum and minimum differences in flight altitude between the fixed-wing UAV and the quadrotor UAV are 1.261 m and 0.910 m, respectively. In the counterclockwise spiral flight test, the maximum and minimum differences in flight altitude between the two UAVs are 1.057 m and 0.626 m, maintaining a specific safety distance.

4.3.3. Sharp Maneuver Flight

In certain tasks, UAVs are required to execute intense flight maneuvers, enabling swift changes in both flight direction and altitude. In our experiments, we direct the quadrotor UAV to chase a fixed-wing UAV that swiftly transitioned from straight flight to counterclockwise spiral flight. The trajectories of the two UAVs in the simulation flight system are illustrated in Figure 14. The results substantiate that the quadrotor, controlled by our method, is capable of promptly responding to changes in the leader’s flight state. Throughout the tests, the maximum and minimum differences in flight altitude between the two are 1.722 m and 0.546 m, respectively.

Additionally, we increase the amplitude of maneuverability by conducting chasing flights in scenarios where the fixed-wing UAV engaged in a spiral flight, rapidly transitioning from clockwise to counterclockwise. The flight trajectories generated by the simulation system are depicted in Figure 15, also demonstrating the excellent dynamic performance of our method in effectively chasing the leader. However, we also observe that due to a time-step delay in the chasing flight, the safety distance between the two UAVs significantly decreases when the leader UAV rapidly descends during altitude maneuvers. The flight altitudes of the leader and follower even momentarily align during a brief period.

4.3.4. Long-Term Formation Flight

Achieving long-term formation flights is a crucial aspect when evaluating the performance of the method. We conduct an experiment with a flight duration exceeding 30 min. Throughout the entire flight, the fixed-wing UAV autonomously navigates towards waypoints randomly generated by the program, executing diverse maneuvers such as straight flight and spiral flight. The quadrotor UAV is entirely controlled by our method throughout the process. Using the open-source ground station software QGroundControl (QGC) v3.4.4, we illustrate the flight trajectories of both UAVs during the experiment, as shown in Figure 16. The experimental results demonstrate that the follower can consistently track the leader and maintain formation flight, confirming the long-term stability of our method.

4.4. Ablation Study

In VFNet, we have incorporated the self-attention mechanism and utilized the accumulated information to predict waypoints. Furthermore, our method employs ResNet to extract and utilize semantic features in aerial target images. To evaluate their impact on the performance of VFNet, we perform an ablation study by removing the corresponding component, using the same data as mentioned in the waypoint prediction experiment.

As shown in Table 1, “w/o” indicates the removal of the corresponding component from VFNet, while “(current)” signifies the exclusion of accumulated information. we calculated the mean deviation and standard deviation between the predictions of the neural network and the outputs of the communication method for each scenario. Values closer to zero for both the mean and standard deviation indicate that the waypoint predictions of the neural network align more closely with the ideal results from the communication method, demonstrating a more effective learning outcome for the formation strategy.

The results reveal that the original method’s predictions more consistently approximate the outputs of the communication method, and any reduction could hinder VFNet’s ability to learn effective waypoint strategies. Furthermore, the experimental findings show that the absence of image semantic features and self-attention mechanisms has a more significant detrimental effect on VFNet’s performance compared to the absence of accumulated information.

We specifically analyze the scenario where the removal of ResNet leads to the absence of semantic features from the images. In this case, the neural network loses its ability to perceive the flight states of the fixed-wing UAV, relying solely on the observational perspective within the quadrotor UAV’s camera view, as provided by the bounding box data. This limitation not only directly reduces the network’s waypoint prediction capability but also causes the issue of “label noise”, leading to confusion in the predictions. This occurs because, under the same observational perspective, the fixed-wing UAV may exhibit different flight states, each corresponding to a different expected waypoint. As a result, the network cannot reliably determine the exact waypoint needed. This issue becomes particularly evident when the fixed-wing UAV undergoes sharp maneuvers, leading to substantial changes in the quadrotor UAV’s observational perspective. As shown in Figure 17, the waypoint predictions under this condition reveal substantial deviations compared to the outputs from the communication method, particularly during high-maneuvering flights around the 30-s, 50-s, and 65-s marks. The maximum deviations reach 21.33 m and 16.58 m along the

X_{n}

and

Y_{n}

axes, respectively, and 1.31 m along the

Z_{n}

axis. These results underscore a considerable performance gap compared to the original VFNet, as illustrated in Figure 10.

To investigate the impact of introducing the multi-head self-attention mechanism on improving the waypoint prediction capability of VFNet, we conduct a replacement study. As shown in Table 2, we replace the multi-head self-attention mechanism with multilayer perceptrons (MLPs) and a gated recurrent unit (GRU), indicated by “[]”. In the MLP replacement, the observational features from three time steps are concatenated and used as the input, while in the GRU replacement, the observation features from each time step are fed sequentially into the GRU, and its hidden states are concatenated as the output. Experimental results show that VFNet with multi-head self-attention significantly outperforms the versions with MLP and GRU replacements, highlighting the superiority of multi-head self-attention in waypoint prediction capability.

4.5. Discussion

The results of our experiments indicate that VFNet effectively predicts waypoints, enabling the quadrotor UAV to pursue the fixed-wing UAV across various maneuvering scenarios. However, it is essential to note that our simulation tests were conducted under ideal, obstacle-free conditions, which are suitable for the simple target-detection algorithm discussed in Section 3.2. This setup contrasts with real-world environments, where obstacles such as trees and buildings can obscure visibility, and backgrounds may be influenced by factors like weather conditions. In such scenarios, the simplified target-detection algorithm may struggle to accurately identify both aerial targets and obstacles, potentially leading to distorted images and unreliable bounding box data, which can result in abnormal predictions from the neural network. Nevertheless, numerous advanced target-detection algorithms have been developed to effectively identify targets within complex backgrounds and obstacles. These algorithms are capable of producing four-dimensional bounding box vectors

[x, y, w, h]

that are compatible with our method, suggesting that their integration could significantly enhance the robustness of our system. Furthermore, since our method utilizes only clipped images of aerial targets, it remains functional as long as obstructions do not intrude upon the small area, which further supports its applicability in complex environments.

5. Conclusions

In this paper, we presented a learning-based UAV chasing control method that enabled a quadrotor UAV to chase a fixed-wing UAV, establishing a leader-follower formation. The method constructed an end-to-end neural network called VFNet using a multi-head self-attention mechanism. By integrating images from onboard monocular vision and UAV flight state data, VFNet accurately predicted the flight waypoints. The quadrotor’s yaw angle was controlled by calculating the LOS between the camera and the leader, ensuring that the leader remained within the field of view during flights. Using Gazebo, ROS, and PX4, we constructed a simulation flight system for neural network training and method validation. Results from flight experiments with the fixed-wing UAV performing various maneuvers, along with long-term flight trials, demonstrated that our approach effectively and stably controlled the chasing flight. Additionally, ablation studies revealed the positive impact of extracting and utilizing semantic features from aerial target images for waypoint prediction. In our future work, we aimed to incorporate reinforcement-learning algorithms, allowing UAVs to autonomously explore chasing flight strategies rather than solely learning from communication methods. Furthermore, we aim to integrate obstacle detection and advanced path-planning algorithms to adjust predicted waypoints, enhancing the quadrotor UAV’s navigation safety in complex environments.

Author Contributions

Conceptualization, Y.J.; formal analysis, Y.J., T.S., C.D. and K.W.; funding acquisition, G.S.; investigation, K.W.; methodology, Y.J.; software, Y.J. and T.S.; supervision, G.S.; validation, T.S. and C.D.; writing—original draft, Y.J. and C.D.; writing—review and editing, K.W. and G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China under Grant No. 62236007, and was supported by the specialized research projects of Huanjiang Laboratory.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV(s)	Unmanned Aerial Vehicle(s)
VFNet	Vision Follow Net
PID	Proportional-Integral-Derivative
KCF	Kernelized Correlation Filters
DNN(s)	Deep Neural Network(s)
FCNN	Fully Convolutional Neural Network
YOLO	You Only Look Once
SR	Search Region
FRD	Forward, Right, Down Coordinate System
NED	North-East-Earth Coordinate System
RGB	Red, Green, Blue
LOS	Line-of-Sight
ReLU	Rectified Linear Unit
ResNet	Residual Network
LHUC	Learning Hidden Unit Contributions
ROS	Robot Operating System
QGC	QGroundControl
MLP(s)	Multilayer Perceptron(s)
GRU	Gated Recurrent Unit

References

Li, H.; Chen, H.; Wang, X. Affine Formation Tracking Control of Unmanned Aerial Vehicles. Front. Inf. Technol. Electron. Eng. 2022, 23, 909–919. [Google Scholar] [CrossRef]
Khan, M.A.; Menouar, H.; Eldeeb, A.; Abu-Dayya, A.; Salim, F.D. On the Detection of Unauthorized Drones—Techniques and Future Perspectives: A Review. IEEE Sens. J. 2022, 22, 11439–11455. [Google Scholar] [CrossRef]
Li, D.; Zhou, J.; Huang, J.; Zhang, D.; Li, P.; Law, R.; Wu, E.Q. State Prediction and Anti-Interference-Based Flight Path-Following for UAVs. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15236–15247. [Google Scholar] [CrossRef]
Wang, T.m.; Zhang, Y.c.; Liang, J.h.; Chen, Y.; Wang, C.l. Multi-UAV Collaborative System with a Feature Fast Matching Algorithm. Front. Inf. Technol. Electron. Eng. 2020, 21, 1695–1712. [Google Scholar] [CrossRef]
Hai, X.; Qiu, H.; Wen, C.; Feng, Q. A Novel Distributed Situation Awareness Consensus Approach for UAV Swarm Systems. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14706–14717. [Google Scholar] [CrossRef]
Nawaz, H.; Ali, H.M.; Laghari, A.A. UAV Communication Networks Issues: A Review. Arch. Comput. Methods Eng. 2021, 28, 1349–1369. [Google Scholar] [CrossRef]
Isaac-Medina, B.K.S.; Poyser, M.; Organisciak, D.; Willcocks, C.G.; Breckon, T.P.; Shum, H.P.H. Unmanned Aerial Vehicle Visual Detection and Tracking Using Deep Neural Networks: A Performance Benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1223–1232. [Google Scholar]
Kang, H.; Joung, J.; Kim, J.; Kang, J.; Cho, Y.S. Protect Your Sky: A Survey of Counter Unmanned Aerial Vehicle Systems. IEEE Access 2020, 8, 168671–168710. [Google Scholar] [CrossRef]
McCoy, J.; Rawal, A.; Rawat, D.B.; Sadler, B.M. Ensemble Deep Learning for Sustainable Multimodal UAV Classification. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15425–15434. [Google Scholar] [CrossRef]
Rothe, J.; Strohmeier, M.; Montenegro, S. A Concept for Catching Drones with a Net Carried by Cooperative UAVs. In Proceedings of the 2019 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Wurzburg, Germany, 2–4 September 2019; pp. 126–132. [Google Scholar] [CrossRef]
Meng, X.; Ding, X.; Guo, P. A Net-Launching Mechanism for UAV to Capture Aerial Moving Target. In Proceedings of the 2018 IEEE International Conference on Mechatronics and Automation (ICMA), Changchun, China, 5–8 August 2018; pp. 461–468. [Google Scholar] [CrossRef]
Ali, Z.A.; Israr, A.; Alkhammash, E.H.; Hadjouni, M. A Leader-Follower Formation Control of Multi-UAVs via an Adaptive Hybrid Controller. Complexity 2021, 2021, 9231636. [Google Scholar] [CrossRef]
Zhang, J.; Yan, J.; Zhang, P. Multi-UAV Formation Control Based on a Novel Back-Stepping Approach. IEEE Trans. Veh. Technol. 2020, 69, 2437–2448. [Google Scholar] [CrossRef]
Wang, J.; Luo, M. Multi-UAV adaptive super-twisting integral terminal sliding mode formation control. Eng. Res. Express 2024, 6, 035235. [Google Scholar] [CrossRef]
Chen, Y.; Deng, T. Leader-Follower UAV formation flight control based on feature modelling. Syst. Sci. Control Eng. 2023, 11, 2268153. [Google Scholar] [CrossRef]
Chen, T.; Gao, Q.; Guo, M. An improved multiple UAVs cooperative flight algorithm based on Leader Follower strategy. In Proceedings of the 2018 Chinese Control And Decision Conference (CCDC), Shenyang, China, 9–11 June 2018; pp. 165–169. [Google Scholar] [CrossRef]
Walter, V.; Staub, N.; Franchi, A.; Saska, M. UVDAR System for Visual Relative Localization With Application to Leader-Follower Formations of Multirotor UAVs. IEEE Robot. Autom. Lett. 2019, 4, 2637–2644. [Google Scholar] [CrossRef]
Zhao, X.; Fei, Q.; Geng, Q. Vision Based Ground Target Tracking for Rotor UAV. In Proceedings of the 2013 10th IEEE International Conference on Control and Automation (ICCA), Hangzhou, China, 12–14 June 2013; pp. 1907–1911. [Google Scholar] [CrossRef]
Keipour, A.; Pereira, G.A.S.; Bonatti, R.; Garg, R.; Rastogi, P.; Dubey, G.; Scherer, S. Visual Servoing Approach to Autonomous UAV Landing on a Moving Vehicle. Sensors 2022, 22, 6549. [Google Scholar] [CrossRef]
Morales, J.; Castelo, I.; Serra, R.; Lima, P.U.; Basiri, M. Vision-Based Autonomous Following of a Moving Platform and Landing for an Unmanned Aerial Vehicle. Sensors 2023, 23, 829. [Google Scholar] [CrossRef]
Zhang, H.T.; Hu, B.B.; Xu, Z.; Cai, Z.; Liu, B.; Wang, X.; Geng, T.; Zhong, S.; Zhao, J. Visual Navigation and Landing Control of an Unmanned Aerial Vehicle on a Moving Autonomous Surface Vehicle via Adaptive Learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5345–5355. [Google Scholar] [CrossRef]
Lin, F.; Peng, K.; Dong, X.; Zhao, S.; Chen, B.M. Vision-Based Formation for UAVs. In Proceedings of the 11th IEEE International Conference on Control & Automation (ICCA), Taichung, Taiwan, 18–20 June 2014; pp. 1375–1380. [Google Scholar] [CrossRef]
Liu, X.; Yang, Y.; Ma, C.; Li, J.; Zhang, S. Real-Time Visual Tracking of Moving Targets Using a Low-Cost Unmanned Aerial Vehicle with a 3-Axis Stabilized Gimbal System. Appl. Sci. 2020, 10, 5064. [Google Scholar] [CrossRef]
Lee, D.; Park, W.; Yi, J.; Byun, W.; Huh, S.; Nam, W. Quadcopter Capable of Autonomously Chasing Micro-Aircraft with Real-Time Visual Tracker. In Proceedings of the 2023 International Conference on Unmanned Aircraft Systems (ICUAS), Warsaw, Poland, 6–9 June 2023; pp. 407–412. [Google Scholar] [CrossRef]
Kim, K.; Kim, J.; Lee, H.G.; Choi, J.; Fan, J.; Joung, J. UAV Chasing Based on YOLOv3 and Object Tracker for Counter UAV Systems. IEEE Access 2023, 11, 34659–34673. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-Air Visual Detection of Micro-UAVs: An Experimental Evaluation of Deep Learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]
Li, J.; Ning, Z.; He, S.; Lee, C.H.; Zhao, S. Three-Dimensional Bearing-Only Target Following via Observability-Enhanced Helical Guidance. IEEE Trans. Robot. 2023, 39, 1509–1526. [Google Scholar] [CrossRef]
Schilling, F.; Lecoeur, J.; Schiano, F.; Floreano, D. Learning Vision-Based Flight in Drone Swarms by Imitation. IEEE Robot. Autom. Lett. 2019, 4, 4523–4530. [Google Scholar] [CrossRef]
Brescianini, D.; Hehn, M.; D’Andrea, R. Nonlinear Quadrocopter Attitude Control: Technical Report; ETH Zurich: Zürich, Switzerland, 2013. [Google Scholar] [CrossRef]
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Swietojanski, P.; Li, J.; Renals, S. Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 1450–1463. [Google Scholar] [CrossRef]
Chang, J.; Zhang, C.; Hui, Y.; Leng, D.; Niu, Y.; Song, Y.; Gai, K. PEPNet: Parameter and Embedding Personalized Network for Infusing with Personalized Prior Information. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, Long Beach, CA, USA, 6–10 August 2023; pp. 3795–3804. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Koenig, N.; Howard, A. Design and Use Paradigms for Gazebo, an Open-Source Multi-Robot Simulator. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), Sendai, Japan, 28 September–2 October 2004; Volume 3, pp. 2149–2154. [Google Scholar] [CrossRef]
Meier, L.; Honegger, D.; Pollefeys, M. PX4: A Node-Based Multithreaded Open Source Robotics Framework for Deeply Embedded Platforms. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 6235–6240. [Google Scholar] [CrossRef]
Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A. ROS: An Open-Source Robot Operating System. In Proceedings of the ICRA Workshop on Open Source Software, Kobe, Japan, 12–17 May 2009; Volume 3, p. 5. [Google Scholar]

Figure 1. The leader-follower system.

Figure 2. The monocular camera model.

Figure 3. The workflow of the learning-based unmanned aerial vehicle (UAV) chasing control method.

Figure 4. Obtaining fixed-wing UAV target from the image captured by monocular camera. Since the area occupied by the target fixed-wing UAV is relatively small in the images, we have enlarged it to enhance visibility.

Figure 5. Forward propagation of the vision follow net (VFNet). Parameters in the green rectangles are trainable.

Figure 6. The detailed architecture of the learning hidden unit contributions (LHUC) algorithm.

Figure 7. The multi-head self-attention architecture used in the waypoint prediction module.

Figure 8. The leader-follower formation in the landing state is shown in the Gazebo simulator, with the fixed-wing UAV in a yellow box and the quadrotor UAV in an orange box.

Figure 9. The training loss curve.

Figure 10. The results of the waypoint prediction experiment. A smaller absolute value of deviation indicates better performance.

Figure 11. The trajectories of straight flight.

Figure 12. The trajectories of clockwise spiral flight.

Figure 13. The trajectories of counterclockwise spiral flight.

Figure 14. The trajectories of turning from straight flight to an counterclockwise spiral flight.

Figure 15. The trajectories of turning from clockwise flight to counterclockwise spiral flight.

Figure 16. The trajectories of long-term flight. In figure, the green line denotes the trajectory of the leader fixed-wing UAV, while the red line represents that of the follower quadrotor UAV.

Figure 17. The results of waypoint prediction experiments with VFNet after removing the ResNet component. The smaller absolute value of deviation indicates better performance.

Table 1. The mean and standard deviation involved in ablation studies. Values closer to zero for both the mean and standard deviation indicate better performance.

Methods	$X_{n}$	$Y_{n}$	$Z_{n}$
VFNet	$0.0535 \pm 0.3916$	$0.0640 \pm 0.3759$	$0.0143 \pm 0.1863$
VFNet (current)	$0.1330 \pm 0.6784$	$0.0723 \pm 0.5055$	$- 0.0194 \pm 0.1959$
VFNet (w/o ResNet)	$0.7929 \pm 4.1855$	$0.1697 \pm 3.9360$	$0.0178 \pm 0.3136$
VFNet (w/o Attention)	$1.2060 \pm 4.2285$	$- 0.2332 \pm 3.8477$	$0.0532 \pm 0.2883$

Table 2. The mean and standard deviation involved in the study of the self-attention mechanism replacement. Values closer to zero for both the mean and standard deviation indicate better performance.

Methods	$X_{n}$	$Y_{n}$	$Z_{n}$
VFNet	$0.0535 \pm 0.3916$	$0.0640 \pm 0.3759$	$0.0143 \pm 0.1863$
VFNet [MLPs]	$0.1123 \pm 0.5572$	$0.0770 \pm 0.4846$	$- 0.0285 \pm 0.1905$
VFNet [GRU]	$0.1348 \pm 0.7839$	$0.1635 \pm 0.7867$	$- 0.0287 \pm 0.2552$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, Y.; Song, T.; Dai, C.; Wang, K.; Song, G. Autonomous UAV Chasing with Monocular Vision: A Learning-Based Approach. Aerospace 2024, 11, 928. https://doi.org/10.3390/aerospace11110928

AMA Style

Jin Y, Song T, Dai C, Wang K, Song G. Autonomous UAV Chasing with Monocular Vision: A Learning-Based Approach. Aerospace. 2024; 11(11):928. https://doi.org/10.3390/aerospace11110928

Chicago/Turabian Style

Jin, Yuxuan, Tiantian Song, Chengjie Dai, Ke Wang, and Guanghua Song. 2024. "Autonomous UAV Chasing with Monocular Vision: A Learning-Based Approach" Aerospace 11, no. 11: 928. https://doi.org/10.3390/aerospace11110928

APA Style

Jin, Y., Song, T., Dai, C., Wang, K., & Song, G. (2024). Autonomous UAV Chasing with Monocular Vision: A Learning-Based Approach. Aerospace, 11(11), 928. https://doi.org/10.3390/aerospace11110928

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous UAV Chasing with Monocular Vision: A Learning-Based Approach

Abstract

1. Introduction

2. Problem Statement

2.1. Formation Model

2.2. Coordinate Systems and Symbols

3. Proposed Method

3.1. Workflow

3.2. Obtaining Target from the Image

3.3. The Angular Geometry Calculator

3.3.1. Calculate LOS Angle

3.3.2. The Azimuth Between Target and Follower

3.4. Vision Follow Net

3.4.1. Data Preprocessing

3.4.2. Feature Extraction and Aggregation

3.4.3. Self-Attention-Based Waypoint Prediction

3.4.4. Simulation Flight System and Neural Network Training

4. Experiments

4.1. Experimental Setup

4.2. Waypoint Prediction Experiment

4.3. Multi-Scene Chasing Test

4.3.1. Straight Flight

4.3.2. Spiral Flight

4.3.3. Sharp Maneuver Flight

4.3.4. Long-Term Formation Flight

4.4. Ablation Study

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI