1. Introduction
Unmanned aerial vehicles (UAVs) have demonstrated substantial potential and hold promising applications in fields such as surveillance, search and rescue, agriculture, and national defense [
1,
2,
3], attracting considerable attention from both academia and industry. Specifically, UAV formations have become a research hotspot. Current approaches for UAV formations mostly rely on cooperative frameworks, where information exchange between UAVs facilitates coordinated flight and collaborative tasks [
4,
5,
6]. However, in complex environments with strong interference, collaborative methods like radio communication are susceptible to disruption and make cooperation unreliable. Furthermore, as UAVs have become more accessible in recent years, the probability of their misuse has grown, leading to heightened security concerns over malicious flights [
7,
8,
9]. Sometimes, there is a necessity to track or apprehend intruding UAVs under non-cooperative conditions [
10,
11]. These non-cooperative scenarios require UAVs to autonomously perceive and analyze information from the environment and generate formation flight control strategies without external assistance. As a typical formation pattern, the leader-follower formation becomes the focus of our research.
Research on leader-follower formation control for UAVs in cooperative environments has made significant strides. Given the flight state of the leader UAV is known, many studies have focused on the theoretical aspects of mathematical modeling for the system. A variety of control methods have been proposed, including PID control [
12], backstepping control [
13], sliding mode control [
14], and hybrid control that integrate neural networks [
15], aimed at achieving stable formation control in both disturbed and undisturbed conditions. Additionally, a number of studies have utilized the collaborative relationships among UAVs, employing techniques like advanced radio [
16] and special markers [
17] to gather extra information for formation control. However, these methods become impractical in non-cooperative scenarios, as they rely on conditions that are not met in such contexts.
Visual perception is a crucial means for UAVs to acquire information. By capturing images of targets using the onboard camera, UAVs can autonomously navigate and accomplish specific tasks. For stationary and moving targets on the ground, a number of studies have been carried out. For example, works such as [
18,
19,
20,
21] have extensively explored UAVs tracking ground targets and landing on moving targets on land or water. However, research on aerial targets remains limited due to the increased complexity of their motion, as they have greater degrees of freedom compared to ground targets, which introduces larger tracking difficulties. Existing research predominantly focuses on quadrotors as aerial targets, whose maneuverability is relatively constrained, with only limited exploration of other aircraft types. Additionally, their methods typically treat visual perception and UAV flight control as distinct processes, allowing researchers to separately take advantage of advancements in both visual target detection and UAV control. For instance, Feng Lin et al. developed a vision-based leader-follower UAV formation flying system [
22]. Their system utilized a known camera model and geometric information of the leader to calculate the relative distance and direction of the leader with respect to the follower. Under the quasi-steady-state assumption, it computed the velocity and acceleration of the leader, guiding the follower UAV in flight. Xuancen Liu et al. employed the Kernelized Correlation Filters (KCF) algorithm to detect and track the target in real-time [
23]. They equipped a rotary-wing UAV with a three-axis gimbaled camera, precisely directed at the target, thereby accomplishing follow-flight through a proportional-guidance tracking strategy. Particularly, with the rapid advancement of target detection based on deep neural networks (DNNs), there is an enhanced capability to identify aerial targets from images more effectively. Donghee Lee et al. developed a visual tracker comprising an adaptive search region (SR) and a fully convolutional neural network (FCNN) to obtain the precise location of a target UAV in images [
24]. They successfully controlled a micro UAV to follow a target UAV using this tracker. Kyubin Kim et al. also proposed a UAV tracking system consisting of a UAV tracker and a control signal generator [
25]. The tracker utilized the YOLOv3 object detection network for target identification. Ye Zheng et al. [
26] introduced the Det-Fly dataset and evaluated the performance of eight representative deep-learning algorithms for air-to-air drone detection. Building on this, Jianan Li et al. [
27] proposed a new pseudo-linear Kalman filter and a novel 3-D helical guidance law to enable a quadrotor UAV to pursue another. However, in these two-stage methods, the visual perception and UAV control algorithms operate in isolation, with neural networks having no involvement in controlling UAV flight. As a result, the UAV control algorithms only receive basic image information from the outputs of the visual perception algorithms, such as the target’s position and size within the image, missing the opportunity to utilize the semantic information extracted by the neural networks. Fabian Schilling et al. [
28] proposed an entirely visual approach for coordinating markerless UAV swarms based on the DAgger imitation-learning algorithm. Their method is one-stage but requires UAVs to be equipped with up to six cameras to provide omnidirectional vision, which makes it challenging to apply to most existing UAVs.
In this context, we employ a common monocular camera mounted on a quadrotor UAV as the image acquisition sensor and propose a learning-based visual chasing flight control method for UAVs in non-cooperative scenarios, eliminating the need for radio communication or other external assistance. Taking a highly maneuverable fixed-wing UAV as the aerial target, our approach allows the quadrotor UAV to follow its flight, establishing a leader-follower formation. The main contributions of our work are outlined below:
We develop a novel end-to-end deep neural network model called Vision Follow Net (VFNet) that integrates multi-source data, including images captured by the onboard monocular camera and the flight state of the quadrotor UAV. By employing a multi-head self-attention mechanism, VFNet aggregates information over a temporal window to predict the waypoints needed for the quadrotor’s chasing flight. Additionally, by calculating the line-of-sight (LOS) angle to the target, our method controls the yaw angle of the quadrotor, ensuring that the target remains within the field of view of the onboard camera during formation flying.
We implement a simulation flight system with a fixed-wing UAV and a quadrotor UAV, forming a leader-follower formation. By conducting flight simulations, we validate the effectiveness of the proposed method. The experimental results demonstrate that our method allows UAVs to execute sharp maneuvers and achieve stable chasing flights over long periods and distances. Additionally, ablation studies show that the neural network, through the extraction and learning of deep features from target images, markedly improves the performance of waypoint predictions.
3. Proposed Method
In this section, we introduce our learning-based UAV chasing control method, including target detection, the angular geometry calculator, and the VFNet.
3.1. Workflow
Considering the underactuated nature of the quadrotor UAV [
23], we only need to directly track the desired yaw angles and waypoints to control its flight. The flight controller will then determine the desired roll angles and pitch angles by the known ones. Our method utilizes images captured by the onboard camera and quadrotor UAV’s flight state data to derive these necessary control inputs.
Figure 3 illustrates the entire workflow of our proposed method. At any given moment, the monocular camera of the quadrotor UAV captures an image containing the aerial target. The flight controller measures the velocity
and three attitude angles—pitch
, yaw
, and roll
—in the navigation coordinate system, reflecting the current flight state. The image is processed through a target-detection algorithm then provides the position and size of the target in the image coordinate system. As this information enables the creation of a bounding box around the target, it is referred to as bounding box data. To extract semantic features, a region containing the target is excised from the original image, referred to as the clipped image. It has a fixed side length of 224 pixels. Processed from the clipped image, along with the bounding box and flight state data, the VFNet infers the waypoint of the quadrotor in the subsequent moment. Simultaneously, the angular geometry calculator uses bounding box data and flight attitude angles to compute the desired yaw angle for the UAV. Finally, the waypoint and the desired yaw angle are transmitted to the flight controller, which employs a cascaded control architecture that integrates Proportional (P) and Proportional-Integral-Derivative (PID) controllers, ultimately governing the flight of the quadrotor UAV [
29]. A detailed explanation will be provided in the following paragraphs.
3.2. Obtaining Target from the Image
Target detection is a fundamental task in computer vision, and a variety of powerful algorithms can effectively address this task, especially with the help of deep learning. Since detecting UAV targets in images is not the primary focus of our method, we have adopted a common bounding box data format to convey the outcomes of target detection. This format comprises a four-dimensional vector , where x and y denote the pixel coordinates of the target’s center in the image coordinate system, and w and h represent the pixel width and height of the target, respectively. Any target-detection algorithm that produces output in this format is compatible with our method.
To simplify our work, we implemented an algorithm based on OpenCV to acquire the UAV targets from images. When the fixed-wing UAV is flying in a desolate scene, the target exhibits a noticeable contrast with the background, enabling effective separation through image binarization. As shown in
Figure 4, the process begins with the conversion of the RGB raw image to grayscale. Next, upper and lower thresholds are set for image binarization. Finally, the bounding box data can be obtained by extracting target contours, filtering based on area, and fitting bounding boxes.
3.3. The Angular Geometry Calculator
After acquiring the target’s pixel coordinates in the image, we use the monocular camera model depicted in
Figure 2 to calculate the LOS angle. By incorporating the known camera mounting angles and flight attitude of the quadrotor, the relative orientation of the fixed-wing UAV with respect to the quadrotor UAV can be determined. This allows us to obtain the desired yaw angle for quadrotor. All computation processes take place within the angular geometry calculator, with the primary steps outlined as follows:
3.3.1. Calculate LOS Angle
Let the coordinates of the fixed-wing UAV target T in the camera frame C be represented as
, and its corresponding position in the image frame I is denoted as
. According to the monocular camera model, we have:
Here,
and
denote the physical dimensions of a single pixel on the image sensor in the x and y directions, respectively.
and
represent the pixel coordinates of point
, where the lens axis intersects the image plane. The camera’s intrinsic matrix
can be acquired using Zhang’s camera calibration method [
30]. In Equation (
1), the unknown scalar
does not affect the direction of
. Therefore, the vector
of LOS in the camera frame is the unit vector of
.
3.3.2. The Azimuth Between Target and Follower
Let the LOS vector in the camera body coordinate system be denoted as
. According to the coordinate axis definition, the relationship between
and
is given by:
Given the camera’s fixed orientation on the quadrotor UAV, denoted by angles
,
, and
, we can derive the rotation matrix
to transform from the camera body frame
M to the UAV body frame
B. Similarly, by using the aircraft attitude angles
,
, and
, we can obtain the transformation matrix
of the frame
B with respect to navigation frame
N. Therefore, the vector
representing the azimuth of the target T relative to the quadrotor UAV is:
To align the quadrotor UAV with the target, the anticipated yaw angle
is as follows:
3.4. Vision Follow Net
VFNet is tasked with predicting waypoints for the quadrotor during the chasing flight. It extracts features from three types of information: the flight state data of the quadrotor UAV, the bounding box data, and the clipped images. These features are then integrated to determine the offsets of the next moment’s waypoint relative to the quadrotor aircraft’s current position.
3.4.1. Data Preprocessing
Data from different sources exhibit varying numerical features. Directly taking them as inputs for the VFNet may lead to difficulties in model training and performance degradation. Therefore, it is necessary to preprocess the data through transformation.
Bounding box data: We normalize the pixel coordinates of the target center (x) and pixel width (w) by dividing them by the image width. Similarly, the pixel coordinates of the target center (y) and pixel height (h) are normalized by dividing them by the image height.
Clipped images: Each channel of the image undergoes a process of subtracting its mean and dividing by its standard deviation, resulting in normalized image data with a mean of 0 and a standard deviation of 1 for each channel.
Flight state data: It consists of the quadrotor’s velocity and attitude angles. The velocity
and attitude angles from the flight controller are represented in the navigation frame. We transform them into the UAV body frame and normalize them, as outlined in Equation (
5). Here,
represents the unit vector of the quadrotor’s flight velocity in the UAV body frame, and
can be obtained from angles
,
, and
.
represents the known maximum flight speed of the quadrotor, while
signifies the normalized current flight speed of the quadrotor.
3.4.2. Feature Extraction and Aggregation
As illustrated in
Figure 5, the VFNet architecture is primarily divided into two components: the feature extraction and aggregation module, and the self-attention-based waypoint prediction module.
In this module, features are first extracted from the three types of input information mentioned above.
Embedding by linear layers: Flight state data and bounding box data are processed through linear layers with ReLU as the activation function, followed by layer normalization, resulting in 16-dimensional embeddings for each. Through a concatenation operation, these two embeddings are combined into a 32-dimensional vector, which encodes the follower UAV’s “personalized” flight information as well as its observational perspective on the fixed-wing UAV.
Embedding by ResNet: The clipped images are processed through a deep residual network (ResNet) [
31] with 18 layers, producing a 32-dimensional embedding containing deep semantic information. This embedding primarily reflects the “objective” flight state of the fixed-wing UAV as observed by the follower.
Then, the features are aggregated. Inspired by the learning hidden unit contributions (LHUC) algorithm proposed in the field of speech recognition [
32] and its application in recommendation systems [
33], we introduced it to aggregate the “personalized” and “objective” features. As
Figure 6, the follower’s “personalized” features are fed into a Gate Neural Unit composed of two layers. The first layer is a linear layer with ReLU activation, containing trainable weights W and biases b, which is used to perform intra-feature interactions to better capture the complex influences among personalized features. The output from the first layer is processed by trainable weights W’ and biases b’, then passed through a sigmoid function in the second layer to produce a 32-dimensional gate vector
, with its scale controlled by a hyperparameter
to range within [0,
]. We set
to 2, allowing the gate vector to amplify or attenuate the controlled information. Finally,
is combined with the image features using a Hadamard product, an operation that takes two vectors of the same dimensions and produces a new vector in which each element is the product of the corresponding elements of the input vectors.
The final output is a personalized representation of the fixed-wing UAV’s flight state, referred to as observational features.
3.4.3. Self-Attention-Based Waypoint Prediction
The second module is responsible for predicting waypoints using observational features. As the flight state of a UAV does not undergo abrupt changes over short durations, for predicting the waypoints of the next moment, the information acquired at present is crucial, but accumulated information from previous moments can also provide valuable insights. To this end, we apply a multi-head self-attention unit [
34]. This mechanism allows the neural network to focus on different parts of the input sequence. By building upon the basic self-attention framework, it introduces multiple attention heads that compute attention weights in parallel, capturing diverse perspectives of the data. Each head highlights different aspects, and by integrating these, the network gains a more comprehensive representation, improving its ability to understand complex relationships.
As shown in
Figure 7, in VFNet, the observational features from the current and two previous time steps are fed into a multi-head self-attention unit with four heads, enabling the network to capture changes in the follower UAV’s flight state and the tracked target over a short temporal window, thereby enhancing its understanding of the flight trend. These features are transformed into query, key, and value vectors via weight matrices
,
, and
, and then organized into query (Q), key (K), and value (V) matrices. The matrix product of the query and key matrices is computed to derive attention scores. These scores are then scaled and normalized through a SoftMax function, producing attention weights, as shown in Equation (
7), where
is the dimension of the key vector. The attention weights measure the importance of information for future flight and are used to perform a weighted sum of the value matrix, resulting in the attention output.
In addition, to strengthen the network’s focus on the present flight, we concatenate the current observational features with the flattened output of the attention unit. Subsequently, it passes through a linear layer with a ReLU activation function, producing the final output: the normalized offset of the next waypoint relative to the current position of quadrotor in the UAV body frame, as outlined in Equations (
8) and (
9):
Here,
,
, and
, respectively, represent the normalized waypoint offset values along the
,
, and
axes.
denotes the normalized magnitude of the waypoint offset vector, while
stands for the predetermined maximum reference value for the magnitude of the waypoint offset. To obtain the waypoint offset
in the navigation frame, we perform the waypoint transformation as depicted in Equation (
10):
where
R represents the magnitude of the predicted waypoint offset. Ultimately, the waypoint for the next moment is as follows:
3.4.4. Simulation Flight System and Neural Network Training
As a deep neural network, VFNet develops its predictive capability through training. However, collecting the necessary data in the real world poses significant challenges. Therefore, we construct a simulation flight system based on the open-source physics simulator Gazebo [
35], flight controller PX4 [
36], and the Robot Operating System (ROS) [
37]. This system simulates the leader-follower system, comprising a fixed-wing UAV as the target and a quadrotor UAV as the follower, as shown in
Figure 8. Gazebo provides a realistic physical simulation environment, simulating attributes such as gravity and inertia during UAV flight, while generating various sensor data. As we do not require the follower to precisely adhere to the leader’s trajectory, developing a corresponding kinematic model to accurately control the quadrotor UAV’s attitude is not strictly necessary. Instead, we allow the PX4 flight controller to take responsibility for perceiving and fusing the necessary sensor data to compute real-time flight attitude, as well as managing the fundamental dynamics control of the UAV. ROS serves as a communication bridge, with a control node functioning as a virtual master computer. The flight controllers, along with the onboard camera, are connected to ROS, enabling data transmission to the virtual master computer. Through ROS, the virtual master computer transmits flight commands consisting of waypoints and yaw angles, effectively governing the flight behavior of the quadrotor UAV.
To obtain training data, the simulation flight system operates in two modes: one for offline training and the other for online training. In both modes, the fixed-wing UAV, serving as the leader, flies freely according to waypoints generated by the program or manually input. The quadrotor UAV, acting as the follower, controls its yaw angle based on outputs from the angular geometric calculator. The virtual master computer sends the current aerial position of the fixed-wing UAV to the quadrotor as the next waypoint. This represents an ideal leader-follower formation strategy under conditions of effective communication between the two UAVs. We employ imitation learning to train the neural network to mimic this strategy. During offline training, the virtual master computer records the quadrotor’s positional data, attitude angles, waypoint data, and images captured by the onboard camera throughout the flight, creating a dataset. In online training mode, the fixed-wing UAV is programmed to fly towards randomly generated waypoints within a reasonable area, and the data gathered during the flight is directly used for network training.
VFNet first utilizes the dataset for offline supervised training. Subsequently, to mitigate potential biases in the dataset, VFNet will undergo a period of online training. As the fixed-wing UAV remains in stable flight states, such as straight-line flight, for most of the time, with only brief intervals of high-maneuver activity, the training data presents a long-tail distribution, consisting of a large proportion of stable flight data and a small fraction of high-maneuver data. To address this, we employ the L1 loss function to fit stable flight data and the L2 loss to handle high-maneuver “outliers”. Adagrad is selected as the optimizer for the entire training process, and the loss curve is shown in
Figure 9.
4. Experiments
To evaluate the effectiveness of our proposed method, we conducted a series of experiments using the simulation flight system. Furthermore, through the ablation study, we have validated that learning semantic features from clipped images significantly enhances the predictive method.
4.1. Experimental Setup
The simulation flight system is developed using ROS Noetic and Gazebo 11, coupled with PX4 autopilot version 1.13.0. In our experiments, both types of UAVs are configured with a maximum flight speed () of 20 m/s, while the fixed-wing UAV has a minimum speed of 10 m/s. The quadrotor UAV’s onboard camera has a resolution of 1920 × 1080, with an 80° horizontal and 45° vertical field of view. The maximum reference value for the magnitude of the waypoint offset () is set to be 30 m. Throughout the experiments, the ROS rate of the simulation flight system is 5 Hz.
The experimental platform utilizes a computer running the Ubuntu 20.04 operating system and is equipped with a 12-core AMD Ryzen 9 7900X CPU, 64 GB of memory, and an Nvidia RTX 4070 GPU, ensuring adequate computing resources.
4.2. Waypoint Prediction Experiment
We select a data segment that includes various maneuvers by the fixed-wing UAV, such as clockwise and counterclockwise spirals, along with changes in flight altitude, to serve as the test set. We use VFNet to predict waypoints and compare the results with the ideal waypoints under communication conditions. The deviation between the two is then calculated in the navigation coordinate system, as illustrated in
Figure 10.
From the heading perspective, VFNet predictions exhibit minimal deviations from the communication approach during stable maneuvers conducted by the leader, such as continuous clockwise or counterclockwise spirals, with an absolute deviation of less than 0.8 m. However, the deviation significantly increases during sharp maneuvers, particularly during the transition from clockwise to counterclockwise spirals around the 30-s mark. Similarly, when assessing altitude, VFNet predictions show minimal deviations during minor maneuvers by the leader, but this pattern shifts during sharp maneuvers. In this test set, the maximum absolute deviation along the axis is approximately 1.10 m, while the maximum deviations along the and axes are 1.66 m and 2.56 m, respectively.
Despite the variations in deviations across different maneuvering states, the overall deviations remain consistently close to zero. Statistical analysis indicates that the average deviations along the , , and axes are 0.0535 m, 0.0640 m, and 0.0143 m, respectively. These findings suggest that the waypoint prediction results of VFNet closely align with those of the communication method, demonstrating its effectiveness in learning the formation flying strategy according to the communication approach.
4.3. Multi-Scene Chasing Test
To comprehensively assess the performance of our proposed chasing flight method, a series of multi-scene chasing tests are conducted in a simulation flight system. In addition to scenarios involving common maneuvers performed by the fixed-wing UAV, such as straight flight, clockwise, and counterclockwise spiral flights, we specifically examine its dynamic performance by testing two typical scenarios that require sharp maneuverability: a transition from straight flight to spiral flight, and rapid changes in the direction of spiral flight. All test flights are conducted with variations in flight altitude. Moreover, through long-term chasing flight experiments that encompassed various flight scenarios, we validate the stability of our method.
In this section, the coordinates of the UAV flight trajectory are represented in the navigation frame, with the UAV takeoff point serving as the origin.
4.3.1. Straight Flight
Straight flight is one of the fundamental modes of UAV flight. In this experiment, we direct the fixed-wing UAV to cover approximately 350 m in the northeast direction within a duration of about 30 s, employing our method to control the quadrotor UAV to chase it. As depicted in
Figure 11, the simulation flight system record the flight trajectories of both UAVs. On the far right of the figure, a three-dimensional visualization illustrates the evolution of the UAV trajectories during flight, with the trajectory color gradually deepening as the UAVs progress. The results indicate that our method effectively accomplishes the chasing task in this scenario. Throughout this flight segment, the maximum and minimum differences in flight altitude between the fixed-wing UAV and the quadrotor UAV are 0.959 m and 0.630 m, respectively, demonstrating the ability to maintain a certain safety distance.
4.3.2. Spiral Flight
Spiral flight is a prevalent mode for fixed-wing UAVs engaged in tasks such as aerial photography and area monitoring. Unlike straight flight, the constant changes in the velocity direction of the fixed-wing UAV during spiral flight impose higher demands on the tracking capabilities of the quadrotor UAV. In our experiments, we direct the fixed-wing UAV to perform both clockwise and counterclockwise spiral flights, depicting their respective trajectories in
Figure 12 and
Figure 13. The trajectories demonstrate that the quadrotor UAV, under the control of our method, timely perceives variations in the target’s direction, enabling real-time chasing. In the clockwise spiral flight test, the maximum and minimum differences in flight altitude between the fixed-wing UAV and the quadrotor UAV are 1.261 m and 0.910 m, respectively. In the counterclockwise spiral flight test, the maximum and minimum differences in flight altitude between the two UAVs are 1.057 m and 0.626 m, maintaining a specific safety distance.
4.3.3. Sharp Maneuver Flight
In certain tasks, UAVs are required to execute intense flight maneuvers, enabling swift changes in both flight direction and altitude. In our experiments, we direct the quadrotor UAV to chase a fixed-wing UAV that swiftly transitioned from straight flight to counterclockwise spiral flight. The trajectories of the two UAVs in the simulation flight system are illustrated in
Figure 14. The results substantiate that the quadrotor, controlled by our method, is capable of promptly responding to changes in the leader’s flight state. Throughout the tests, the maximum and minimum differences in flight altitude between the two are 1.722 m and 0.546 m, respectively.
Additionally, we increase the amplitude of maneuverability by conducting chasing flights in scenarios where the fixed-wing UAV engaged in a spiral flight, rapidly transitioning from clockwise to counterclockwise. The flight trajectories generated by the simulation system are depicted in
Figure 15, also demonstrating the excellent dynamic performance of our method in effectively chasing the leader. However, we also observe that due to a time-step delay in the chasing flight, the safety distance between the two UAVs significantly decreases when the leader UAV rapidly descends during altitude maneuvers. The flight altitudes of the leader and follower even momentarily align during a brief period.
4.3.4. Long-Term Formation Flight
Achieving long-term formation flights is a crucial aspect when evaluating the performance of the method. We conduct an experiment with a flight duration exceeding 30 min. Throughout the entire flight, the fixed-wing UAV autonomously navigates towards waypoints randomly generated by the program, executing diverse maneuvers such as straight flight and spiral flight. The quadrotor UAV is entirely controlled by our method throughout the process. Using the open-source ground station software QGroundControl (QGC) v3.4.4, we illustrate the flight trajectories of both UAVs during the experiment, as shown in
Figure 16. The experimental results demonstrate that the follower can consistently track the leader and maintain formation flight, confirming the long-term stability of our method.
4.4. Ablation Study
In VFNet, we have incorporated the self-attention mechanism and utilized the accumulated information to predict waypoints. Furthermore, our method employs ResNet to extract and utilize semantic features in aerial target images. To evaluate their impact on the performance of VFNet, we perform an ablation study by removing the corresponding component, using the same data as mentioned in the waypoint prediction experiment.
As shown in
Table 1, “w/o” indicates the removal of the corresponding component from VFNet, while “(current)” signifies the exclusion of accumulated information. we calculated the mean deviation and standard deviation between the predictions of the neural network and the outputs of the communication method for each scenario. Values closer to zero for both the mean and standard deviation indicate that the waypoint predictions of the neural network align more closely with the ideal results from the communication method, demonstrating a more effective learning outcome for the formation strategy.
The results reveal that the original method’s predictions more consistently approximate the outputs of the communication method, and any reduction could hinder VFNet’s ability to learn effective waypoint strategies. Furthermore, the experimental findings show that the absence of image semantic features and self-attention mechanisms has a more significant detrimental effect on VFNet’s performance compared to the absence of accumulated information.
We specifically analyze the scenario where the removal of ResNet leads to the absence of semantic features from the images. In this case, the neural network loses its ability to perceive the flight states of the fixed-wing UAV, relying solely on the observational perspective within the quadrotor UAV’s camera view, as provided by the bounding box data. This limitation not only directly reduces the network’s waypoint prediction capability but also causes the issue of “label noise”, leading to confusion in the predictions. This occurs because, under the same observational perspective, the fixed-wing UAV may exhibit different flight states, each corresponding to a different expected waypoint. As a result, the network cannot reliably determine the exact waypoint needed. This issue becomes particularly evident when the fixed-wing UAV undergoes sharp maneuvers, leading to substantial changes in the quadrotor UAV’s observational perspective. As shown in
Figure 17, the waypoint predictions under this condition reveal substantial deviations compared to the outputs from the communication method, particularly during high-maneuvering flights around the 30-s, 50-s, and 65-s marks. The maximum deviations reach 21.33 m and 16.58 m along the
and
axes, respectively, and 1.31 m along the
axis. These results underscore a considerable performance gap compared to the original VFNet, as illustrated in
Figure 10.
To investigate the impact of introducing the multi-head self-attention mechanism on improving the waypoint prediction capability of VFNet, we conduct a replacement study. As shown in
Table 2, we replace the multi-head self-attention mechanism with multilayer perceptrons (MLPs) and a gated recurrent unit (GRU), indicated by “[]”. In the MLP replacement, the observational features from three time steps are concatenated and used as the input, while in the GRU replacement, the observation features from each time step are fed sequentially into the GRU, and its hidden states are concatenated as the output. Experimental results show that VFNet with multi-head self-attention significantly outperforms the versions with MLP and GRU replacements, highlighting the superiority of multi-head self-attention in waypoint prediction capability.
4.5. Discussion
The results of our experiments indicate that VFNet effectively predicts waypoints, enabling the quadrotor UAV to pursue the fixed-wing UAV across various maneuvering scenarios. However, it is essential to note that our simulation tests were conducted under ideal, obstacle-free conditions, which are suitable for the simple target-detection algorithm discussed in
Section 3.2. This setup contrasts with real-world environments, where obstacles such as trees and buildings can obscure visibility, and backgrounds may be influenced by factors like weather conditions. In such scenarios, the simplified target-detection algorithm may struggle to accurately identify both aerial targets and obstacles, potentially leading to distorted images and unreliable bounding box data, which can result in abnormal predictions from the neural network. Nevertheless, numerous advanced target-detection algorithms have been developed to effectively identify targets within complex backgrounds and obstacles. These algorithms are capable of producing four-dimensional bounding box vectors
that are compatible with our method, suggesting that their integration could significantly enhance the robustness of our system. Furthermore, since our method utilizes only clipped images of aerial targets, it remains functional as long as obstructions do not intrude upon the small area, which further supports its applicability in complex environments.
5. Conclusions
In this paper, we presented a learning-based UAV chasing control method that enabled a quadrotor UAV to chase a fixed-wing UAV, establishing a leader-follower formation. The method constructed an end-to-end neural network called VFNet using a multi-head self-attention mechanism. By integrating images from onboard monocular vision and UAV flight state data, VFNet accurately predicted the flight waypoints. The quadrotor’s yaw angle was controlled by calculating the LOS between the camera and the leader, ensuring that the leader remained within the field of view during flights. Using Gazebo, ROS, and PX4, we constructed a simulation flight system for neural network training and method validation. Results from flight experiments with the fixed-wing UAV performing various maneuvers, along with long-term flight trials, demonstrated that our approach effectively and stably controlled the chasing flight. Additionally, ablation studies revealed the positive impact of extracting and utilizing semantic features from aerial target images for waypoint prediction. In our future work, we aimed to incorporate reinforcement-learning algorithms, allowing UAVs to autonomously explore chasing flight strategies rather than solely learning from communication methods. Furthermore, we aim to integrate obstacle detection and advanced path-planning algorithms to adjust predicted waypoints, enhancing the quadrotor UAV’s navigation safety in complex environments.