Open AccessArticle

Advanced Cooperative Formation Control in Variable-Sweep Wing UAVs via the MADDPG–VSC Algorithm

Zhengyang Cao

^1,2

and

Gang Chen

^1,*

State Key Laboratory of Strength and Vibration for Mechanic Structures, School of Aerospace Engineering, Xi’an Jiaotong University, Xi’an 710049, China

Xi’an ASN UAV Technology Co., Ltd., Xi’an 710065, China

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9048; https://doi.org/10.3390/app14199048

Submission received: 11 September 2024 / Revised: 26 September 2024 / Accepted: 2 October 2024 / Published: 7 October 2024

(This article belongs to the Special Issue Collaborative Learning and Optimization Theory and Its Applications)

Download

Browse Figures

Figure 1
Schematic of dynamic relationships in a variable-sweep wing UAV. "> Figure 2
Diagram of the variable-sweep wing UAV and mass centers. (a) Structural diagram of the L-30A variable-sweep wing UAV. (b) Schematic of mass centers and position vectors. "> Figure 3
Rotating mechanism at the wing–fuselage junction of the L-30A UAV. (a) Schematic of the rotation mechanism. (b) Schematic of the wing–fuselage junction detail. "> Figure 4
Aerodynamic characteristics of the L-30A variable-sweep wing UAV. (a) Lift and drag coefficients. (b) Lift-to-drag ratio. "> Figure 5
Schematic of cooperative formation control in variable-sweep wing UAV system. "> Figure 6
Structure of the MADDPG–VSC algorithm model. "> Figure 7
Schematic representation of the simulation environment. (a) Terrain schematic; (b) simple scenario; (c) complex scenario. "> Figure 8
UAV trajectories using MADDPG, MADDPG–VSC, and SAC algorithms. (a) MADDPG algorithm; (b) MADDPG–VSC algorithm; (c) SAC algorithm. "> Figure 9
Training reward curves for MADDPG, MADDPG–VSC, and SAC algorithms. "> Figure 10
Network parameter variations using MADDPG, MADDPG–VSC, and SAC algorithms. (a) MADDPG algorithm; (b) MADDPG–VSC algorithm; (c) SAC algorithm. "> Figure 10 Cont.
Network parameter variations using MADDPG, MADDPG–VSC, and SAC algorithms. (a) MADDPG algorithm; (b) MADDPG–VSC algorithm; (c) SAC algorithm. "> Figure 11
Schematic of the L-30A UAV platform and its components: (a) L-30A UAV system; (b) sensors and safety mechanisms. "> Figure 12
Experimental hardware platform and task controller. (a) Hardware platform; (b) task controller. "> Figure 13
Scenario map of the formation flight trajectories. "> Figure 14
Latency comparison across different hardware platforms. "> Figure 15
Energy consumption comparison across hardware platforms. "> Figure 16
Fault tolerance and reliability comparison across hardware platforms. "> Figure 17
Trajectory tracking error comparison between MADDPG and MADDPG–VSC. (a) Trajectory of MADDPG; (b) trajectory of MADDPG–VSC. ">

Versions Notes

Abstract

UAV technology is advancing rapidly, and variable-sweep wing UAVs are increasingly valuable because they can adapt to different flight conditions. However, conventional control methods often struggle with managing continuous action spaces and responding to dynamic environments, making them inadequate for complex multi-UAV cooperative formation control tasks. To address these challenges, this study presents an innovative framework that integrates dynamic modeling with morphing control, optimized by the multi-agent deep deterministic policy gradient for two-sweep control (MADDPG–VSC) algorithm. This approach enables real-time sweep angle adjustments based on current flight states, significantly enhancing aerodynamic efficiency and overall UAV performance. The precise motion state model for wing morphing developed in this study underpins the MADDPG–VSC algorithm’s implementation. The algorithm not only optimizes multi-UAV formation control efficiency but also improves obstacle avoidance, attitude stability, and decision-making speed. Extensive simulations and real-world experiments consistently demonstrate that the proposed algorithm outperforms contemporary methods in multiple aspects, underscoring its practical applicability in complex aerial systems. This study advances control technologies for morphing-wing UAV formation and offers new insights into multi-agent cooperative control, with substantial potential for real-world applications.

Keywords:

multi-UAV; variable-sweep wing; multi-agent deep reinforcement learning; formation control; MADDPG–VSC

1. Introduction

As unmanned aerial vehicle (UAV) technology continues to advance rapidly, the value of variable-sweep wing UAVs has become increasingly evident, particularly due to their ability to adapt to a wide range of flight states, which significantly enhances overall performance and efficiency [1]. Traditional trajectory planning algorithms, such as A*, rapidly-exploring random trees (RRT), and probabilistic roadmaps (PRM), have been widely employed in UAV systems due to their effectiveness in structured environments with known dynamics and constraints. However, these classical methods often face challenges when applied to high-dimensional spaces, dynamic obstacles, and real-time applications, where their computational complexity and adaptability are limited [2,3,4]. This growing recognition highlights the critical importance of advancing research in variable-sweep wing technology through more adaptive control mechanisms, such as deep reinforcement learning (DRL). However, despite their potential, current research has not fully optimized these systems, particularly in the implementation of advanced control mechanisms such as the proximal policy optimization (PPO) algorithm [5]. With the expanding application of multi-UAV systems across commercial, environmental, and defense sectors, where coordinated operation is crucial, optimizing variable-sweep wing UAVs has become imperative [6]. This study seeks to integrate variable-sweep wing technology with advanced control algorithms, aiming to enhance multi-UAV system capabilities by providing more flexible and efficient solutions, especially for formation and cooperative control tasks.

In the area of formation control, variable-sweep wing UAVs face significant challenges, particularly in achieving precise control and maintaining formation. Early studies, such as those by Valasek et al. [7] and Lampton et al. [8,9], employed Q-learning to optimize formation control for morphing UAVs based on aerodynamic characteristics. While these methods showed some effectiveness, their application was limited by the discrete nature of state and action spaces. Valasek et al. [10] applied deep reinforcement learning to simplified morphing aircraft models, achieving some progress in formation control; however, the simplifications in the models constrained the precise definition and execution of control actions. Xu et al. [11] advanced the field by using the deep deterministic policy gradient (DDPG) algorithm to control multiple biomimetic morphing UAVs, but this approach still lacks flight validation to support the optimization of formation control. Yan et al. [12] explored the adaptive control of UAV wing configurations using Q-learning, but these studies have yet to establish a direct link between formation control strategies and actual improvements in flight performance.

In the domain of cooperative control, existing methods such as Q-learning and deep Q-network (DQN) algorithms [13,14] have been widely applied in UAV systems, but they exhibit significant limitations when dealing with continuous action spaces and dynamic environments. Specifically, the discrete nature of state and action spaces in Q-learning restricts its adaptability, making it incapable of addressing the complexities of dynamic cooperative control tasks. While DQN is more effective in handling high-dimensional control tasks, it still faces challenges in complex dynamic environments, particularly in regards to multi-UAV cooperative formation control [15,16]. These methods often struggle to maintain stability and efficiency in scenarios requiring real-time responsiveness and robust decision making [17]. Therefore, there is an urgent need for advanced algorithms capable of overcoming these challenges, particularly in the context of variable-sweep wing UAVs, which demand precise and adaptive control mechanisms.

To tackle these challenges, this study presents the following key innovations:

(1): Optimization of the multi-rigid-body dynamic model for an adaptive morphing wing.

A novel optimization model for the multi-rigid-body dynamics of the variable-sweep wing UAV is introduced. Through precise model adjustments and dynamic optimization, the method significantly enhances the UAV’s sweep angle adjustment and dynamic response capabilities across various flight states, providing robust theoretical support for attitude control during complex flight missions. Unlike simplified linearized models commonly used in UAV control, which overlook mass distribution and inertia variations, the proposed model demonstrates enhanced real-time adaptability, resulting in improved control precision across diverse flight conditions.

(2): Development of a cooperative control algorithm for variable-sweep wing UAVs.

An enhanced multi-agent deep deterministic policy gradient for the variable-sweep control (MADDPG–VSC) algorithm has been developed, specifically tailored for the cooperative control of multiple variable-sweep wing UAVs. The algorithm employs an adaptive sweep angle control strategy to achieve efficient cooperative control in complex environments, significantly improving trajectory planning and obstacle avoidance capabilities.

(3): Construction of an adaptive optimization system for multi-objective reinforcement learning.

A multi-objective reward system has been designed to dynamically balance the performance requirements of multi-UAVs in complex tasks. This system ensures global optimization and rapid convergence, effectively avoiding local optima and enhancing the stability and flexibility of task execution.

In summary, while existing UAV formation control methods have established a foundation for multi-UAV systems, their limitations in handling continuous action spaces and dynamic environments highlight the need for more advanced solutions. The MADDPG–VSC algorithm introduced in this study addresses these challenges, offering a novel approach to improving the performance of variable-sweep wing UAVs in complex, real-world scenarios. This study not only enhances the current methodologies but also provides a solid framework for future developments in multi-UAV cooperative control.

This study is structured as follows: Section 2 analyzes the dynamics of variable-sweep wing UAVs, including longitudinal dynamics and aerodynamic characteristics. Section 3 explores cooperative formation control using multi-agent reinforcement learning (MARL), focusing on the MADDPG–VSC algorithm’s design, modeling, and simulation. Section 4 details experimental validation and performance analysis, including platform construction and application results. Finally, Section 5 summarizes the findings and suggests directions for future research.

2. Dynamics Analysis of the Variable-Sweep Wing UAV

The L-30A UAV, featuring a variable-sweep wing design, can adjust its sweep angles during flight to optimize aerodynamic efficiency and stability. This dynamic capability enhances flight performance but introduces significant nonlinearity and complexity due to changes in the center of mass and aerodynamic forces. To accurately simulate these complex dynamics, a multi-rigid-body dynamics model is essential. This model considers the interactions among various rigid bodies under different flight states, enabling precise simulation of the L-30A UAV’s dynamic response.

2.1. Multi-Rigid-Body Dynamic Model

To simulate the complex dynamic behavior of the L-30A variable-sweep wing UAV during flight, a multi-rigid-body dynamics model based on the Newton–Euler equations is employed. This model accurately captures changes in the center of mass and aerodynamic characteristics caused by variations in sweep angles. Given the significant changes in the center of mass and inertia characteristics of the UAV’s fuselage and wings during flight, the mass distribution across each rigid body is assumed to be uniform, and the UAV’s geometric shape is considered symmetrical along the longitudinal axis. These assumptions provide the foundation for deriving the dynamic equations. In contrast to simplified models, this multi-rigid-body model dynamically adjusts to real-time variations in mass and inertia, enhancing control accuracy and stability across different flight conditions. Utilizing this approach, the dynamic relationships of the multi-rigid-body system are established, as illustrated in Figure 1.

(1): Expression of center of mass and velocity

The center of mass for each rigid body is defined, with O_i representing the center of mass, and

l_{i}

representing the position vector from the origin O_b of the fuselage coordinate system to O_i. The centers of mass for the left and right wings are denoted as O₁ and O₂, respectively, as shown in Figure 2. The relative velocity of O_i is given by

v (p_{O_{i}}) = \frac{{d l}_{i}}{dt} = v_{i} - v_{0} (i = 1, 2)

(1)

This equation describes the relative velocity of the wing’s center of mass with respect to the fuselage.

v_{i}

is the velocity of the wing, and

v_{0}

is the velocity of the fuselage.

(2): Multi-rigid-body dynamics equation

The dynamics of each rigid body are defined. For each rigid body

i

, including the fuselage and two wings, the force equation is given by

F_{i} {= m}_{i} \frac{{dv}_{i}}{dt} (i = 0, 1, 2)

(2)

This equation describes the acceleration of the rigid body under the applied force.

F_{i}

represents the resultant force acting on the rigid body,

m_{i}

is the mass of the rigid body, and

{dv}_{i} / dt

is the acceleration of the rigid body.

(3): Total force and dynamic equilibrium of the system

By integrating the equations of motion for the centers of mass of both the fuselage and wings, the total dynamic equation for the system can be obtained, describing the overall dynamic behavior of the system during the morphing process, as follows:

F_{aero} + G + T {+ F}_{s} = (m_{0} + m_{1} + m_{2}) \frac{{dv}_{0}}{dt}

(3)

This equation combines all forces acting on the system, including aerodynamic force

F_{aero}

, gravity

G

, thrust

T

, and additional forces

F_{s}

generated by wing deformation, to calculate the system’s acceleration

{dv}_{0} / dt

(4): Calculation of additional force

The additional force

F_{s}

generated during the wing deformation process reflects the impact of structural changes on the aerodynamic and inertial forces. The expression for

F_{s}

is given by

F_{s} = \underset{i = 1}{\sum^{2}} F_{s_{i}} = \underset{i = 1}{\sum^{2}} {- m}_{i} \frac{d v (p_{O_{i}})}{dt}

(4)

(5): Simplification of forces and moment

To simplify the dynamic analysis, it is assumed that the centers of mass lie within the plane of the fuselage coordinate system with longitudinal symmetry, and that the dynamics are primarily influenced by longitudinal components. Consequently, the calculation of additional forces can be further simplified as follows:

F_{s} = 2 m_{1} [{- \ddot{l}}_{1 x} + ω_{y}^{2} l_{1 x} + 2 ω_{y} {\dot{l}}_{1 x} + {\dot{ω}}_{y} l_{1 x}]

(5)

where

l_{1 x}

represents the position of the left swept wing,

{\dot{l}}_{1 x}

represents its rotational velocity, and

{\ddot{l}}_{1 x}

represents its rotational acceleration.

ω_{y}

represents the angular velocity of the centroid point of the deformable wing rotating about the origin point

O_{b}

of the fuselage coordinate system, and

{\dot{ω}}_{y}

represents the angular acceleration of the centroid point of the deformable wing rotating about the origin point

O_{b}

of the fuselage coordinate system. The factor of 2 accounts for the symmetrical contribution of the left and right wings, assuming they generate identical additional forces. The forces generated by the fuselage are considered negligible in this context due to the dominance of the wing dynamics in the overall behavior.

The simplified expression highlights the important longitudinal force components and their impact on dynamic behavior. The moment

H_{o_{b}}

about the origin

O_{b}

can be expressed as the integral of the cross product of position and velocity, as follows:

H_{o_{b} i} \equiv \int ({l_{i}^{*} \times v}_{i}^{*}) d m_{i}^{*} (i = 1, 2)

(6)

where

l_{i}^{*}

represents the vector radius of

O_{b}

, and

v_{i}^{*}

represents the velocity in the inertial coordinate system.

The time derivative of the moment can be computed to reflect the dynamic changes in response to morphing actions, as follows:

\frac{d H_{o_{b} i}}{dt} = \int {(l}_{i}^{*} \times \frac{d v_{i}^{*}}{dt}) d m_{i}^{*} + \int (\frac{d l_{i}^{*}}{dt} \times v_{i}^{*}) d m_{i}^{*} = M_{o_{i}} + m_{i} {\dot{l}}_{i} \times v_{0}

(7)

where

M_{o_{i}}

represents the external torque applied to each wing around

O_{b}

, and

{\dot{l}}_{i}

represents the time derivative of the position vector of the wing.

(6): Dynamic equilibrium of the system

In a state of force and moment equilibrium, the longitudinal dynamic equations for the variable-sweep wing UAV can be expressed as follows:

\{\begin{matrix} T \cos α - D - G \sin ϑ + F_{p_{x}} \cos α + F_{p_{z}} \sin α = m \frac{d v}{dt} \\ T \sin α + L - G \cos ϑ + F_{p_{x}} \sin α - F_{p_{z}} \cos α = m v \frac{d ϑ}{dt} \\ (I_{1 y} + I_{2 y}) \frac{d ω_{y}}{dt} = M_{SG} + M_{SD} \end{matrix}

(8)

where

α

represents the angle of attack, and

ϑ

represents the pitch angle.

D

represents the drag force acting on the UAV, defined as

D = C_{d} {qS}_{ref}

, where

C_{d}

is the drag coefficient, and

q

is the dynamic pressure, calculated as

q = \frac{1}{2} ρ v^{2}

, with

ρ

representing the air density and

v

indicating the velocity.

S_{ref}

is the reference area.

L

represents the lift force of the UAV, defined as

L = C_{l} {qS}_{ref}

, where

C_{l}

is the lift coefficient.

I_{1 y}

, and

I_{2 y}

represent the moments of inertia along the y-axis for the left and right wings, respectively.

M_{SG}

represents the moment due to gravitational force along the radius vector, multiplied by the moment arm.

M_{SD}

represents the moment due to aerodynamic resistance.

(7): Relationship between sweep angle and moment

To specifically analyze the effect of the sweep angle on the system’s dynamics, the relationship between the sweep angle and the geometric configuration of the wings can be expressed as

l_{1 x} = a_{p} \sin Λ

(9)

This equation links the geometric configuration of the wings to the angular changes in the sweep angle

Λ

, providing a basis for calculating specific aerodynamic forces and moments. The term

a_{p}

is a constant related to the wing’s shape.

(8): Linearization of the dynamic equation

To simplify the control analysis, the longitudinal dynamic equations of the system can be linearized as follows:

\dot{v} = \frac{0.5 ρ v^{2} S_{ref} C_{t} \cos α}{m_{i}} - \frac{0.5 ρ v^{2} S_{ref} C_{d}}{m_{i}} - g \sin ϑ + \frac{2 m_{1} (- {\ddot{l}}_{1 x} + ω_{y}^{2} l_{1 x}) \cos α}{m_{i}}

(10)

where

C_{t}

is the thrust coefficient, and

g

represents the acceleration due to gravity.

This linearization is valid under the assumption of small angles, where the variations in angle of attack

α

and pitch angle ϑ are within a limited range. Such conditions are typical for steady-level flight and small perturbations around equilibrium. In practical applications, this approximation effectively simplifies the analysis and control of the system, while still capturing the essential dynamics. However, for scenarios involving large angular changes or highly dynamic maneuvers, the nonlinear effects could become significant and require a more comprehensive modeling approach.

2.2. Dynamic Characteristics Analysis of the L-30A UAV

To comprehensively understand the aerodynamic characteristics of the L-30A variable-sweep wing UAV, the multi-rigid-body dynamics model established earlier is used to analyze the aerodynamic forces and moments experienced by the L-30A under various flight states, including takeoff, cruising, maneuvering, and diving. This analysis aims to provide a foundation for optimizing control strategies.

(1): Connection between the model and aerodynamic analysis

The multi-rigid-body dynamics model enables precise calculations of the aerodynamic forces and moments acting on the L-30A UAV. These calculations are crucial for understanding the UAV’s performance during different flight states and for determining the most effective control strategies to optimize performance.

(2): Detailed analysis of dynamic characteristics

This study focuses on the L-30A variable-sweep wing UAV, an evolution of the L-30 fixed-wing UAV originally designed with a fixed 6° sweep angle. The L-30A introduces a critical rotating mechanism at the wing–fuselage junction, allowing for adjustment of the wing’s sweep angle during flight, as shown in Figure 3. The key performance parameters of the L-30 series UAVs are compared in Table 1.

(3): Aerodynamic characteristics analysis

The aerodynamic characteristics of the L-30A UAV were analyzed across various flight states, including takeoff, cruising, maneuvering, and diving. To ensure optimal performance, the sweep angle of the variable-sweep wing was restricted to 16° during maneuvering and 60° during diving. The aerodynamic forces and moments calculated using the multi-rigid-body dynamics model, as detailed in Table 2, are critical for understanding the UAV’s behavior during these states.

The geometric parameters of the L-30A UAV under these two sweep angles are summarized in Table 2. The table shows that at a 16° sweep angle, the wingspan is significantly larger and the chord length smaller, resulting in a longer longitudinal reference length. In contrast, at a 60° sweep angle, the wingspan is reduced, and the chord length is increased, which shortens the longitudinal reference length. These changes in geometry are crucial for adapting the UAV’s aerodynamic performance to different flight states, particularly during high-speed maneuvers like diving.

Figure 4 illustrates the relationships between the lift coefficient, drag coefficient, and angle of attack. The lift coefficient shows a quasi-linear relationship with the angle of attack, while the drag coefficient exhibits a quasi-conic relationship. The analysis demonstrates that the maximum lift-to-drag ratio increases with greater sweep angles, highlighting the L-30A UAV’s ability to meet diverse aerodynamic requirements across different flight states, particularly during the diving phase, where the highly swept wings provide enhanced efficiency.

(4): Comparative analysis of aerodynamic characteristics

A comparative analysis was conducted between the fixed-wing L-30 UAV with a 6° sweep angle and the variable-sweep configurations of the L-30A UAV during the maneuvering and diving phases. This comparison highlights the advantages of the variable-sweep wing design over the fixed-wing model in terms of performance and adaptability.

By integrating the multi-rigid-body dynamics model with the aerodynamic analysis, the control strategies for the L-30A UAV have been refined to optimize performance across different flight states. The 16° sweep angle improves lift and stability during maneuvers, while the 60° sweep angle reduces drag in high-speed dives. This optimization enhances the UAV’s adaptability and response in dynamic environments, providing robust support for control under challenging flight conditions. The findings highlight the clear advantages of the variable-sweep wing design over fixed-wing configurations.

The precise dynamic response of the variable-sweep wing UAV has been revealed through the above dynamic modeling. This model establishes a solid theoretical foundation for applying MARL algorithms in UAV formation control. Building on this foundation, the discussion will now focus on the design and application of the MADDPG–VSC algorithm to enhance the efficiency and stability of cooperative UAV control.

3. Cooperative Control of Variable-Sweep Wing UAV via MADDPG

MARL has proven to be highly effective in the cooperative control of variable-sweep wing UAVs, particularly in achieving efficient formation control within complex environments. The primary challenge lies in managing continuous action spaces in dynamic flight states, which traditional reinforcement learning methods struggle to address. To overcome these challenges, the MADDPG–VSC algorithm has been specifically developed and optimized, taking full advantage of the unique characteristics of variable-sweep wing UAVs. This algorithm allows for dynamic adjustment of the sweep angle, thereby optimizing formation control and enhancing trajectory planning and obstacle avoidance capabilities, as illustrated in Figure 5.

3.1. Application of MARL in Variable-Sweep Wing UAV

In complex multi-agent systems, cooperative control is crucial for ensuring that individual agents, such as UAVs, can successfully accomplish their tasks together [18,19,20,21]. However, this process poses several challenges, particularly when managing continuous action spaces in dynamic flight states. The MADDPG–VSC algorithm builds upon the MARL framework, which can be represented as a high-dimensional tuple

< S, A_{1}, A_{2}, \dots, A_{N}, R_{1}, R_{2}, \dots, R_{N}, P_{dyn}, γ >

[22,23,24,25]. In this framework, the state space

S

represents the key states of the UAVs, including position, velocity, and sweep angle; the action space

A

encompasses control inputs such as thrust, steering angle, and control voltage. The reward function

R

evaluates the contribution of each action to the success of the mission; the state transition dynamics

P_{dyn}

is defined as

P_{dyn} : S \times A \times S \to [0, 1]

, describing the probability of state changes; and the discount factor

γ

determines the current value of future rewards [26,27,28]. These elements together form the foundation of the MADDPG–VSC algorithm, providing the theoretical support needed for dynamic cooperative control of UAVs in complex environments.

The MADDPG–VSC algorithm incorporates state-value and action-value functions, alongside Q-learning and policy gradient mechanisms, to optimize UAV control in real time. Their specific implementations in the MADDPG–VSC algorithm are as follows:

(1): State-value function $V_{i} (s)$

The state-value function evaluates the total future reward that agent

i

may receive while in state

s

. For UAV formation control tasks, predicting the long-term effects of different flight trajectories is crucial. Through the state-value function, the MADDPG–VSC algorithm can assess the UAV’s flight state in real-time, enabling the selection of the optimal flight strategy at each moment. The state-value function

V_{i} (s)

is expressed as follows:

V_{i} (s) = E_{i} [r_{t + 1}^{i} + γ V_{i} (s_{t + 1}) | s_{t} = s]

(11)

(2): Action-value function $Q_{i} (s, a)$

The action-value function evaluates the total reward that can be obtained by taking action

a

in state

s

. This allows the MADDPG–VSC algorithm to guide the UAVs in selecting the optimal control strategy in complex environments, ensuring formation stability and mission success. The action-value function

Q_{i} (s, a)

is given by

Q_{i} (s, a) = E_{i} [r_{t + 1}^{i} + γ Q_{i} (s_{t + 1}, a_{t + 1}) | s_{t} = s, a_{t} = a]

(12)

(3): Q-learning update

The Q-learning update mechanism iteratively updates the Q-values to continuously optimize the agent’s strategy, leading to convergence towards an optimal strategy. Through this mechanism, the MADDPG–VSC algorithm can dynamically adjust strategies during flight to adapt to changing environments and task demands. The Q-learning update mechanism is governed by the following equation:

Q_{t + 1} (s_{t}, a_{t}) = Q (s_{t}, a_{t}) + η [r_{t + 1} + γ \max_{a_{t + 1}} Q_{t + 1} (s_{t + a}, a_{t + 1}) - Q (s_{t}, a_{t})]

(13)

(4): Policy gradient update

In dynamic environments, traditional Q-learning methods may struggle to handle continuous action spaces. To address this, the MADDPG–VSC algorithm employs a policy gradient update mechanism, which optimizes the policy parameters

θ^{μ}

. This mechanism guides UAVs in selecting the optimal sweep angle adjustment strategy, thereby improving overall task success. The policy gradient update is formulated as follows:

\nabla_{θ^{μ}} J \approx E [\nabla_{θ^{μ}} Q_{π} (s, a | θ^{Q}) | s = s_{t}, a = μ (s_{t} | θ^{μ})] \cdot \nabla_{θ^{μ}} μ (s | θ^{μ}) | s = s_{t}

(14)

Through these core mechanisms, including the state-value function, action-value function, Q-learning update mechanism, and policy gradient update mechanism, the MADDPG–VSC algorithm enables efficient cooperative control of UAVs in complex and dynamic environments. These mechanisms work together to ensure the stability and flexibility of the algorithm.

3.2. Optimization of the MADDPG–VSC Algorithm for Control Design

To address the limitations of traditional reinforcement learning algorithms in controlling variable-sweep wing UAVs, this study introduces the MADDPG–VSC algorithm. Unlike traditional algorithms that primarily target discrete action spaces, the MADDPG–VSC algorithm is specifically designed for continuous action spaces, enabling real-time adjustments of wing configurations according to flight states. This ensures the aerodynamic efficiency and stability of UAV formations.

3.2.1. Reward Function Design

The MADDPG–VSC algorithm’s reward function balances multiple objectives, including sweep angle adjustment, formation maintenance, energy efficiency, and obstacle avoidance. Each component contributes to optimal UAV performance in dynamic environments.

(1): Sweep angle reward $r_{Λ}$

This reward is given when the algorithm selects the sweep angle

Λ

based on the current state

s

. It evaluates how closely the adjusted sweep angle matches the optimal sweep angle

Λ_{opt}

, thus quantifying the effectiveness of the adjustment in specific flight states. The reward function optimizes sweep angle adjustments for better aerodynamic efficiency and UAV formation stability. Typically, the reward

r_{Λ}^{+}

is set to a positive value, such as 1 or 10, depending on the importance of maintaining optimal sweep angles. The parameter

β_{Λ}

is set to 0.1, allowing for a 10% deviation from the optimal sweep angle to ensure stability while optimizing performance.

r_{Λ} = \{\begin{matrix} r_{Λ}^{+} & if | Λ - Λ_{opt} | \leq β_{Λ} \\ 0 & otherwise \end{matrix}

(15)

(2): Formation reward $r_{formation}$

Assuming that three UAVs form an equilateral triangle formation, the virtual center point

C

is the geometric center of the three UAVs and follows the designated flight trajectory. The formation reward is based on the deviation of each UAV from the center point

C

, with smaller deviations indicating a more stable formation and thus, yielding a higher reward. This calculation ensures that the formation remains tight and stable, which is critical for maintaining formation performance [29]. The value of

α_{formation}

, which influences the importance of formation maintenance, is typically set to 10. This value balances the need for close formation with other mission objectives, such as energy efficiency and obstacle avoidance. The formation reward is then calculated as

r_{formation} = α_{formation} \cdot (1 - \frac{d_{avg}}{d_{\max}})

(16)

where

d_{avg}

represents the average deviation distance of the

N

UAVs from the center point

C

, calculated as

d_{avg} = \frac{1}{N} \underset{i = 1}{\sum^{N}} ‖ p_{i} - p_{C} ‖

. Here,

p_{i}

represents the position of each UAV, and

p_{C}

represents the position of the center point

C

. The term

d_{\max}

denotes the maximum allowable deviation distance, beyond which the formation is considered unstable.

Further, the relative distance

d

between UAVs is defined as:

d = \sqrt{{(x_{i + 1} - x_{i})}^{2} + {(y_{i + 1} - y_{i})}^{2} + {(z_{i + 1} - z_{i})}^{2}}

(17)

where

(x_{i}, y_{i}, z_{i})

and

(x_{i + 1}, y_{i + 1}, z_{i + 1})

represent the positions of two UAVs in the formation. Additionally, the relative distance

d

between UAVs is subject to the constraint

d \leq d_{\max}

, ensuring that the formation remains stable. If the relative distance

d

between UAVs exceeds

d_{\max}

, the formation is considered unstable, and the reward decreases accordingly.

(3): Energy consumption penalty $r_{energy}$

This penalty is incurred for unnecessary energy consumption, optimizing UAV endurance by minimizing energy wastage through thrust and flight trajectory optimization. The typical setting for

α_{energy}

, which discourages inefficient energy use, is 5, promoting a constant and efficient flight velocity. The energy consumption penalty is approximated by using the velocity

∆ v

of the UAV, which is a proxy for energy usage.

|∆ v|

can be defined as

| ∆ v | = | v_{t} - v_{t - 1} |

(18)

where

v_{t}

and

v_{t - 1}

represent the magnitude of the UAV’s velocity vector at the current time step

t

and the previous time step

t - 1

, respectively. Thus, the energy consumption penalty is expressed as

r_{energy} = α_{energy} \cdot | ∆ v |

(19)

(4): Collision avoidance penalty $r_{collision}$

The collision risk

C_{risk}

represents the likelihood of a collision based on the proximity to obstacles or other UAVs. It can be calculated using a function that evaluates the inverse distance, as follows:

C_{risk} = \frac{1}{d + ϵ}

(20)

where

d

is the relative distance between UAVs or between a UAV and an obstacle, and

ϵ

is a small positive constant used to prevent division by zero.

This penalty discourages potential collision events, encouraging UAVs to avoid obstacles and maintain safe distances in complex environments [30]. The value of

α_{collision}

, which reflects the priority of collision avoidance, is typically set to 15. This value ensures that avoiding risks is prioritized, while maintaining a balance with other mission objectives. The collision avoidance penalty is then defined as

r_{collision} = α_{collision} \cdot C_{risk}

(21)

The overall reward function

R

is expressed as

R = r_{Λ} + r_{formation} - r_{energy} - r_{collision}

(22)

This reward function system effectively balances positive reinforcement and penalties, guiding the algorithm toward optimal solutions that meet mission requirements [31]. By optimizing these reward components, the algorithm enhances the overall performance of UAV formations, ensuring that the best control strategies are learned and applied.

3.2.2. Exploration and Learning Efficiency

To ensure adequate exploration in continuous action spaces, the MADDPG–VSC algorithm introduces random noise ξ to action exploration. This noise introduces variability in action selection, preventing the algorithm from converging prematurely to suboptimal policies. The noise parameter

c

is dynamically adjusted during the training process to balance exploration and exploitation. Initially,

c

is kept high to encourage extensive exploration, but it gradually decreases as the algorithm converges to a more stable policy, as follows:

u^{'} = u + ξ

(23)

where

ξ

follows a normal distribution

N (u, c),

with mean

u

and variance

c

. The noise parameter

c

is defined as

c = \{\begin{matrix} \begin{matrix} c & if episode \leq 50 \\ 0.995 c & if episode > 50 \end{matrix} \end{matrix}

(24)

To enhance learning efficiency, the sweep angle

Λ

and other state values are normalized in the input layer, yielding dimensionless state value

\bar{Λ}

. This normalization is defined as

\bar{Λ} = \frac{Λ - Λ_{\min}}{Λ_{\min} - Λ_{\max}}

(25)

where

Λ_{\min}

represents the minimum sweep angle during UAV flight, and

Λ_{\max}

represents the maximum sweep angle. Normalization helps in scaling state values to a uniform range, improving the algorithm’s training efficiency and stability by ensuring that all input features are on a comparable scale, thus contributing equally to the model’s learning process.

3.2.3. Convergence and Stability

The MADDPG–VSC algorithm significantly improves training stability through the introduction of a soft update mechanism and the use of target networks. Given that the environment model and the state transition probability

P

are unknown, these aspects present challenges in reinforcement learning, particularly when dealing with complex and dynamic environments.

To address these challenges, the MADDPG-VSC algorithm utilizes a soft update approach, which gradually integrates new information into the model, ensuring that updates are smooth and that they do not destabilize the learning process. This method involves the construction of a policy network

μ (s | θ^{μ})

, a Q network

Q (s, a | θ^{Q})

, and their respective target networks

μ^{'} (s | θ^{μ^{'}})

and

Q^{'} (s, a | θ^{Q^{'}})

. The soft update mechanism is expressed as

\{\begin{matrix} θ^{μ^{'}} = τ θ^{μ} + (1 - τ) θ^{μ^{'}} \\ θ^{Q^{'}} = τ θ^{Q} + (1 - τ) θ^{Q^{'}} \end{matrix}

(26)

This gradual update method mitigates risks such as gradient explosion or vanishing, stabilizing the training process of the MADDPG–VSC algorithm. It allows the algorithm to progressively approach the optimal policy with enhanced stability, making it particularly effective in the complex and continuously changing environments encountered by variable-sweep wing UAVs.

3.2.4. Adaptive Sweep Angle Control Strategy

The MADDPG–VSC algorithm employs an adaptive sweep angle control strategy, dynamically adjusting the sweep angle in response to real-time flight parameters. This strategy is critical for optimizing UAV formation control, as it allows for adjustments that enhance aerodynamic efficiency and overall mission success. By dynamically responding to changes in velocity, altitude, and other flight conditions, the algorithm ensures that UAV formations maintain optimal configurations throughout various flight states.

This adaptive approach not only improves trajectory planning and obstacle avoidance but also significantly enhances overall UAV performance in diverse and challenging task scenarios. By integrating these techniques, the MADDPG–VSC algorithm provides a robust and flexible solution for cooperative UAV formation control in complex environments.

3.2.5. Flight States of UAV and Control Strategies

In the cooperative control of variable-sweep wing UAVs, the MADDPG–VSC algorithm dynamically adjusts control strategies based on real-time flight states, including velocity, attitude, and sweep angle [32]. These adjustments are crucial for maintaining formation integrity and optimizing overall mission performance [33,34]. The control strategies are specifically tailored to each flight state, ensuring that UAVs can adapt to changing conditions and achieve precise formation assembly and reconfiguration when necessary. Table 3 presents the configurations for different flight states of the L-30A UAV, which are integral to executing cooperative maneuvers within the defined mission parameters.

During the UAV formation process, specific control strategies are required for each flight state to ensure cooperation and stability. During takeoff, the primary focus is on generating sufficient lift with optimized thrust, while maintaining a minimal sweep angle for stability. This state is crucial for precise cooperation during formation assembly. As UAVs enter the climb state, the sweep angle is gradually increased to reduce drag, optimizing energy use while maintaining the integrity of the formation during the ascent.

In the cruise state, the UAVs stabilize with a moderate sweep angle, balancing velocity and fuel efficiency. Steady thrust ensures that the formation remains intact over long distances, allowing for minor adjustments when needed. Maneuvering demands rapid adjustments in rudder control and sweep angle, enabling UAVs to quickly respond to environmental changes and reconfigure the formation as required.

During dives, the UAVs adjust the sweep angle to reduce lift and increase velocity for a controlled descent. In the vertical attack state, UAVs adopt a high sweep angle to minimize drag and increase velocity, ensuring precision targeting while maintaining formation integrity. Finally, during landing, UAVs reduce velocity and lower the sweep angle to ensure stability and safety, with precise thrust and rudder control to facilitate a cooperative landing sequence for the entire formation.

These strategies ensure that UAVs can effectively adapt to various flight states by selecting and executing appropriate actions within the defined state space

S = {x, y, z, v, \dot{v}, Λ}

, where each element corresponds to critical state variables such as position, velocity, acceleration, and sweep angle. The ability to dynamically adjust these control inputs in response to specific flight states is crucial for maintaining formation integrity and achieving optimal performance across different operational states.

3.3. Structure and Training Process of the MADDPG–VSC Algorithm

3.3.1. Multi-Agent Algorithm Model Structure

The MADDPG–VSC algorithm is based on the DDPG framework and employs multi-agent DRL with an actor–critic architecture [35,36,37]. It is specifically designed to address the cooperative control challenges in variable-sweep wing UAVs operating in multi-agent environments. A key innovation of the algorithm lies in the introduction of an adaptive sweep angle control strategy, which enables precise optimization of continuous action spaces, allowing dynamic adjustments to formation strategies to meet complex mission requirements. Figure 6 illustrates the overall structure of the MADDPG–VSC algorithm, highlighting the flow of information and the interactions between various components during the training process. Each DDPG agent, such as a variable-sweep wing UAV, is equipped with its own actor and critic networks. The critic network in the MADDPG–VSC algorithm not only evaluates the state-action pairs of individual agents but also integrates policy information from other agents, forming a global perspective that guides overall formation decisions. This sharing and cooperation of information among multi-agent is crucial for the algorithm to effectively handle complex tasks.

The variable-sweep wing UAV, through the rotating mechanism depicted in the figure, demonstrates how the adaptive sweep angle control strategy dynamically adjusts the flight control parameters through real-time interactions with the environment. The actor network selects optimal actions based on the current state, while stochastic noise is introduced to facilitate policy exploration, and the critic network evaluates these actions. The MADDPG–VSC algorithm leverages actor–critic networks to adapt to complex environments, optimizing UAV formations and reducing task uncertainties. Moreover, the algorithm enhances training stability through the use of an experience replay buffer and a soft update mechanism, mitigating the risks of overfitting and unstable convergence. During each training cycle, both the critic and actor networks progressively optimize their policies, and soft updates gradually align the target networks with the current networks, ensuring efficient decision making and task execution in multi-agent, complex environments.

3.3.2. Training Process

During the training process of the MADDPG–VSC algorithm, agents continuously optimize their control strategies through interaction with the environment and other agents. The following steps outline the key components of the training process:

(1): Initialization.

The critic network parameters

θ^{Q}

and actor network parameters

θ^{μ}

for each agent are initialized. Additionally, the target network parameters

θ^{Q^{'}}

and

θ^{μ^{'}}

are initialized, and the replay buffer

U

is set up for each agent.

(2): Policy selection.

At each time step t, agents select an action

a_{t}

based on its current state

s_{t}

using the policy

μ (s_{t} | θ^{μ})

. To encourage exploration within the continuous action space, stochastic noise

ξ

is added to the action, generating

a^{'} = a + ξ

(3): Experience storage.

After executing the action, each agent receives a reward

r_{t}

and observes the next state

s_{t + 1}

. The transition

({s_{t}, a_{t}, r_{t}, s}_{t + 1})

is stored in the replay buffer

U

(4): Parameter updates.

When the replay buffer

U

reaches capacity, mini-batches of transitions are sampled for network updates. The target value

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

is computed, and the loss function is minimized to update the critic network parameters

θ^{Q}

. The actor network parameters

θ^{μ}

are updated through policy gradient backpropagation.

(5): Target network update.

After the critic and actor networks are updated, the target networks are softly updated using the following equations:

θ^{μ^{'}} = τ θ^{μ} + (1 - τ) θ^{μ^{'}}

and

θ^{Q^{'}} = τ θ^{Q} + (1 - τ) θ^{Q^{'}}

(6): Repeat training.

These steps are repeated throughout the training process until the specified number of episodes is completed or the algorithm converges, ultimately yielding a stable optimal policy. By providing a clear advantage in dynamic and unpredictable scenarios, the MADDPG–VSC algorithm enhances real-time trajectory planning and obstacle avoidance, where traditional algorithms often struggle. By incorporating multi-agent strategies, experience replay, and target networks, the training process is designed to enhance both stability and performance. As the critic and actor networks converge, the algorithm learns optimal control policies that adjust sweep angles and flight trajectories in response to changing environments. This ensures that UAV formations maintain stability and efficiency, enabling precise coordination and effective task execution, even in complex mission scenarios.

3.4. Simulation Verification and Performance Analysis of the MADDPG–VSC

3.4.1. Simulation Environment and Parameter Setting

The simulation environment effectively tests the MADDPG–VSC algorithm’s capabilities in obstacle avoidance and cooperative formation control. The simulation environment is configured within a 1000 m × 1000 m area with a 2000 m altitude ceiling, creating a realistic and challenging space for testing these capabilities. Within this environment, three variable-sweep wing UAVs are initially positioned at specific locations and are tasked with navigating towards a stationary target while successfully avoiding seven strategically placed obstacles. This setup rigorously tests the effectiveness of the UAVs’ control policies, particularly focusing on optimal sweep angle adjustments and coordinated movements necessary for obstacle avoidance and target acquisition. To evaluate the algorithm’s robustness, two simulation scenarios were designed: a simple scenario with a regular quadrilateral flight path and no obstacles, serving as a baseline for basic coordination and trajectory tracking; and a complex scenario with a polygonal flight path, where UAVs must avoid obstacles, including simulated radar detection zones. These setups provide varying trajectory complexities to enhance the model’s adaptability to environments significantly different from the training conditions. In addition to convex obstacles, future simulations will include randomly generated non-convex obstacles to better assess the algorithm’s robustness and adaptability in dynamic environments. Figure 7 illustrates the detailed layout of this simulation environment, including the positions of the UAVs, the targets, and the radars, thereby highlighting the complex dynamics and challenges encountered during the simulation process.

The MADDPG–VSC algorithm is implemented in Python using the TensorFlow 1.9 deep learning framework. The computational platform comprises an Intel E5-2630v4 @ 2.20 GHz CPU and 16GB of memory, ensuring sufficient resources for efficient training and execution. Training is conducted in a closed-loop manner, integrating both exploratory and high-quality experiences to enhance learning efficiency. Environmental data combined with the UAVs’ state information guide the determination of control variables according to the evolving control policy. This policy is continuously refined through feedback from the environment, enabling progressive improvement in the UAVs’ performance over time.

The simulation environment and parameter settings establish a robust foundation for evaluating the MADDPG–VSC algorithm’s capabilities in regards to both obstacle avoidance and cooperative formation control tasks. Table 4 outlines the key hyperparameters used during training, ensuring clarity regarding the configuration and methodological considerations. This setup ensures a comprehensive and realistic assessment of the algorithm’s performance.

3.4.2. Algorithmic Performance Evaluation

The UAV trajectories, shown in Figure 8, compare the performance of the MADDPG, soft actor–critic (SAC) and MADDPG–VSC algorithms. MADDPG, known for its effectiveness in multi-agent reinforcement learning tasks, particularly in handling continuous action spaces, has been widely adopted in UAV formation control. SAC offers a robust alternative with its stochastic policy, effectively balancing exploration and exploitation, and is particularly known for its quick convergence and stable performance in continuous action tasks. Both MADDPG and SAC successfully guide the three variable-sweep wing UAVs toward their target, navigating obstacles. However, MADDPG–VSC shows distinct advantages in more dynamic scenarios. The presence of obstacles necessitates longer trajectories and the application of a deformation acceleration strategy. While all UAVs reached their destination using MADDPG and SAC, MADDPG shows a notable trajectory deviation for UAV₂ when encountering an obstacle, bringing it closer to the obstacle than intended. In comparison, SAC exhibits more precise performance than MADDPG in early training phases but struggles to maintain accuracy in more complex, obstacle-rich environments. In contrast, MADDPG–VSC demonstrates only minor deviations, resulting in a more accurate and efficient flight trajectory. This highlights MADDPG–VSC’s superior precision and effectiveness in highly dynamic environments, particularly in managing precise navigation and obstacle avoidance, thus outperforming SAC in later stages of training and complex scenarios.

The results from Table 5, Table 6 and Table 7 demonstrate that MADDPG–VSC shows a clear advantage in regards to stability, as reflected in both maximum and total deviations. For instance, MADDPG–VSC’s maximum deviation is 18.17 m, compared to MADDPG’s 24.06 m and SAC’s 19.00 m. Its total deviations of 11.83 m and 10.55 m are also smaller than those of SAC and MADDPG. This indicates that in multi-UAV cooperative tasks and complex environments, MADDPG–VSC achieves better trajectory control and reduced deviations. Its formation control capability and stability, particularly in dynamic scenarios, are clearly more reliable.

Trajectory accuracy, reflected through average and median deviations, further emphasizes MADDPG–VSC’s consistent performance. In Table 5, MADDPG–VSC’s average deviation is 8.84 m, compared to MADDPG’s 11.13 m and SAC’s 9.00 m. Although SAC performs similarly to MADDPG–VSC in simpler environments, MADDPG–VSC excels in more complex scenarios. Similarly, in Table 7, MADDPG–VSC’s average deviation is 2.55 m, outperforming MADDPG’s 3.91 m and SAC’s 3.15 m, indicating that MADDPG–VSC offers more precise trajectory control, reducing deviations and keeping UAVs on their intended paths in complex tasks.

Additionally, the adaptive multi-objective reward system dynamically balances task priorities and optimizes UAV coordination in real-time, ensuring global optimization and preventing local optima, which is essential for maintaining stability across complex missions. In summary, MADDPG–VSC demonstrates clear advantages in regards to stability and trajectory accuracy when compared with both MADDPG and SAC. Although SAC occasionally performs similarly to MADDPG–VSC in simpler scenarios, it generally falls short in precision and stability in more complex, dynamic environments. These attributes make MADDPG–VSC more reliable for multi-UAV cooperative tasks, particularly in achieving precise navigation and formation control in challenging environments.

3.4.3. Comparative Analysis of Learning Efficiency and Convergence

To evaluate the training efficacy of the MADDPG–VSC algorithm, both MADDPG and MADDPG–VSC models were trained and compared. Figure 9 shows the reward curves, illustrating their performance and learning progress. In the complex simulation environment with multiple targets and obstacles, both models stabilized after around 3000 training episodes, indicating effective learning. However, stochastic noise during training caused some variability, seen as oscillations in the reward curves, which is common in reinforcement learning.

Both algorithms stabilized after 2500 episodes, but the MADDPG–VSC algorithm achieved stability earlier, consistently yielding higher rewards throughout. This indicates a higher learning efficiency in handling tasks. The MADDPG algorithm exhibited significant fluctuations in later stages, particularly after approximately 1500 episodes, where reward values showed a slight decline. This suggests potential overfitting or instability issues during prolonged training. Although the algorithm eventually converged, the fluctuations indicate weaker adaptability and consistency when faced with complex tasks. In contrast, the MADDPG–VSC algorithm not only converged faster in the early training phase but also maintained a more stable reward value, demonstrating its advantages in multi-UAV formation and obstacle avoidance tasks. By incorporating adaptive sweep angle control, this algorithm significantly improved the stability and accuracy of task execution, particularly excelling in static complex environments.

To evaluate the performance of the MADDPG and MADDPG–VSC algorithms, their Q-value and loss value trends were compared over 3000 training episodes. Figure 10 shows the results, providing insight into the learning progress of both models during multi-UAV formation control and obstacle avoidance tasks. The Q-value reflects the prediction of cumulative rewards, while the loss value indicates the prediction accuracy. Both models were tested under the same complex environment with multiple targets and obstacles.

During training, both algorithms initially showed a sharp increase in Q-values, indicating effective early learning. However, the MADDPG algorithm experienced a decline in Q-values after around 500 episodes, suggesting potential overfitting or instability. This decline indicates that MADDPG struggled to maintain consistent performance in more complex scenarios. Although it eventually converged, the fluctuations suggest weaker adaptability. In contrast, the MADDPG–VSC algorithm maintained a stable Q-value trend throughout training. Its integration of adaptive sweep angle control helped optimize performance, allowing the algorithm to achieve higher and more stable Q-values. This stability demonstrates its improved efficiency in regards to multi-UAV coordination and obstacle avoidance.

The loss value trends further support these findings. While both algorithms reduced their loss values over time, the MADDPG algorithm showed more variability, indicating less stable learning. On the other hand, the MADDPG–VSC algorithm exhibited more consistent and lower loss values, reflecting higher prediction accuracy and better learning stability.

In summary, the MADDPG–VSC algorithm demonstrated greater stability and faster convergence, making it better suited for controlling variable-sweep wing UAVs in complex environments. The improvements in reward stability and loss reduction highlight its ability to handle dynamic and uncertain conditions, such as cooperative formation control and obstacle avoidance.

4. Flight Validation and Performance Analysis

This study constructed a flight validation platform using L-30A variable-sweep wing UAVs and conducted a series of formation flight experiments to evaluate the performance of the MADDPG–VSC algorithm, focusing on its role in cooperative control. The validation process includes the setup of the experimental platform, the execution of formation flights, and a detailed review of the experimental results.

4.1. Experimental Platform Overview

The validation of the MADDPG–VSC algorithm was conducted through a flight validation platform composed of L-30A variable-sweep wing UAVs. This platform was designed to test the effectiveness of the algorithm in cooperative control tasks. The platform was configured to balance hardware performance with algorithmic demands, ensuring reliable and precise flight control.

4.1.1. L-30A UAV System

The L-30A UAV system, as shown in Figure 11, consists of variable-sweep wing UAVs, a ground control station (GCS), and a data link, all designed to execute cooperative multi-UAV tasks. The L-30A UAV dynamically adjusts its sweep angle to optimize aerodynamic performance, while its integrated sensors, including an inertial measurement unit (IMU), a global positioning system (GPS), and high-resolution cameras, provide real-time data for position, velocity, attitude angles, and sweep angle. The system uses a data link to enable high-speed, low-latency communication between the UAVs and the GCS, facilitating the transmission of flight status, sensor data, and control commands. The GCS acts as the command center, receiving and analyzing UAV data in real time, allowing for mission control and flight adjustments, while the UAV formation maintains coordination autonomously to ensure precise execution of cooperative tasks.

4.1.2. Controller Selection and Algorithm Integration

(1): Controller selection and rationale

The ZCU104 platform, which combines an ARM Cortex-A53 processor and FPGA, was selected for its ability to handle intensive tasks like real-time trajectory planning and neural network inference, making it particularly suitable for multi-UAV control [38]. The FPGA accelerates computational tasks and provides the real-time processing capabilities necessary for UAV formation control. Figure 12 illustrates the ZCU104 hardware platform and its integration into the UAV control system.

(2): Hardware performance evaluation and optimization

To evaluate the performance of the ZCU104 platform under different configurations, this study analyzes both execution time and average energy consumption. Table 8 and Table 9 show comparisons of execution time in the default training mode and comparisons across different platforms in terms of energy consumption and runtime. The purpose of this evaluation is to assess the performance of the ARM Cortex-A53 processor running independently, as well as the performance improvement when combined with FPGA, compared to that of high-performance CPU platforms.

Table 8 presents the execution time of the ARM Cortex-A53 processor and the ARM+FPGA combination in default training mode. While the ARM Cortex-A53 processor can handle basic tasks on its own, its execution time is significantly longer than that of the ARM+FPGA combination. By integrating FPGA, the ZCU104 reduces execution time substantially and demonstrates greater efficiency in handling complex tasks.

In contrast to the execution time comparison, Table 9 provides insights into the runtime and energy consumption of different hardware platforms, evaluated using the widely recognized CoreMark benchmark. This benchmark, particularly well-suited for ARM architectures, highlights the significant differences in energy efficiency among platforms. While the Intel E5-2630v4 CPU delivers the fastest runtime, it comes at the cost of considerably higher energy consumption. On the other hand, the ZCU104 platform, which integrates both the ARM Cortex-A53 processor and FPGA, not only maintains processing efficiency but also drastically reduces energy consumption, making it an ideal choice for UAV control tasks where energy efficiency is paramount.

From the comparison in Table 9, it is clear that the ZCU104 platform, when combined with FPGA, offers a high energy efficiency ratio. Not only does it reduce the runtime for real-time tasks, but it also significantly lowers energy consumption. In contrast, while the Intel E5-2630v4 CPU demonstrates a faster runtime, its energy consumption reaches 85 watts, and its larger size and weight restrict its use as an onboard platform. Due to these limitations, the Intel E5-2630v4 is commonly used for high-performance ground simulations, but it is not suitable for onboard hardware applications.

In conclusion, the ARM+FPGA combination of the ZCU104 platform provides an optimal balance of performance and power efficiency, making it the best choice for UAV cluster control and flight validation tasks. Its efficient parallel processing capability and low energy consumption ensure reliability and stability for long-duration flight missions, making it particularly suitable for use as an onboard hardware platform in real-time flight validation.

4.2. Flight Validation and Performance Analysis

The validation of the MADDPG–VSC algorithm’s performance is demonstrated through real-world formation flight experiments. The validation process emphasizes the algorithm’s effectiveness in achieving cooperative control, obstacle avoidance, and overall UAV coordination. Key performance metrics, including trajectory tracking, formation stability, obstacle avoidance, and real-time responsiveness, are analyzed and discussed.

4.2.1. Experiment Process Overview

The flight experiment was conducted in a mission field measuring 2000 m × 1000 m × 1000 m, as shown in Figure 13, with the aim of validating the performance of the MADDPG–VSC and MADDPG algorithms on multi-UAV formation control tasks. The UAV flight trajectories were planned according to the ground truth coordinates, as shown in Table 10. The total flight duration was set to 535 s, with UAVs taking off and landing at the end of the runway. The initial position error was 0 m, and the UAVs maintained an average cruising speed of 20 m/s during the flight. To simulate more complex combat scenarios, the mission field was equipped with three radar detection zones representing enemy anti-aircraft artillery or ground fire threats. These threat zones provided a realistic operational environment for the UAV formation and further evaluated the algorithms’ stability and adaptability in challenging conditions.

The entire experiment was divided into four phases, i.e., takeoff, formation maintenance, mission execution, and safe landing. During the takeoff phase, the three UAVs simultaneously took off from designated starting points and maintained a triangular formation. In the mission execution phase, the UAVs flew along the predefined flight trajectories, adjusting the flight parameters based on real-time sensor data to ensure formation stability and minimize trajectory deviations. After completing the mission, the UAVs safely landed at the designated locations, marking the end of the flight.

To ensure the safety and reliability of the experiment, a series of comprehensive safety protection measures were implemented. The GCS closely monitored the UAV flight trajectories in real time using a feedback system, detecting any speed anomalies or heading deviations. Immediate adjustments or automatic return mechanisms were triggered, if necessary. In addition, a redundant communication system was employed to prevent signal loss or interference, ensuring stable data transmission between the UAVs and the GCS. During the flight, the UAVs were also equipped with parachutes and airbags to prevent collisions with other UAVs or ground obstacles. If any threats or disturbances occurred, the system would automatically deploy the parachutes, ensuring a safe landing.

4.2.2. Hardware Platform Ground Testing and Analysis

The evaluation of three hardware platforms, ZCU104, ARM-A53, and E5-2630v4, was conducted based on four key performance metrics, i.e., latency, energy consumption, throughput, and fault tolerance and reliability. These tests aim to determine the most suitable platform for multi-UAV coordination and control tasks, focusing on system performance under varying task complexities and real-time operational demands.

(1): Latency test

The latency test evaluates how quickly each hardware platform can respond to task commands by simulating real-time control scenarios in multi-UAV flight tasks, such as formation adjustments and obstacle avoidance. The test measures the response performance of each platform in both default training and training modes. To ensure accuracy, the tests were conducted in a controlled local network environment, eliminating external network interference. The evaluation focuses on the response speed of each platform in single-task environments, particularly for tasks requiring high real-time performance.

As UAV formations grow, computational demands increase, which may affect real-time responsiveness. Optimization techniques like parallel processing and computational pruning are being explored to maintain low latency and improve scalability.

As shown in Figure 14, the E5-2630v4 platform exhibits relatively low latency, making it highly responsive to task commands. However, the ZCU104 platform, while displaying a longer overall runtime due to its FPGA’s parallel processing, still performs competitively. In contrast, the ARM-A53 platform demonstrates significantly higher latency, particularly during intensive tasks, rendering it less suitable for applications that require fast, real-time responses.

(2): Energy consumption test

The energy consumption test monitors the energy usage of each platform under multi-task loads, assessing performance across different task complexities. Tasks include cruising, formation adjustments, and obstacle avoidance, simulating typical multi-UAV operations. This test focuses on measuring energy consumption during task execution, particularly under high-load conditions.

Figure 15 shows the energy consumption of the three platforms. The ZCU104 platform demonstrates balanced energy consumption, handling complex tasks while maintaining a relatively low energy profile. The ARM-A53, with lower processing power, exhibits the lowest overall energy consumption, but its energy usage increases with task complexity. The E5-2630v4 platform exhibits significantly higher energy consumption due to its server-grade capabilities, making it less suitable for environments where energy efficiency is critical.

To optimize energy consumption in complex tasks, adaptive resource allocation and sleep modes are being explored, allowing the system to dynamically adjust processing power and prolong flight duration without sacrificing performance.

(3): Throughput test

The throughput test evaluates the capacity of each platform to handle multiple UAV tasks simultaneously. Multiple tasks were loaded to test the efficiency of each platform in multi-task environments. The goal of the throughput test is to measure the ability of each platform to process high loads of tasks, especially in scenarios requiring simultaneous control and data transmission.

Table 11 shows that the E5-2630v4 platform performs best in terms of throughput, processing a large number of parallel tasks, making it ideal for high-density workloads. Although ZCU104’s throughput is lower than E5-2630v4’s, it remains stable and capable of supporting multi-UAV coordination tasks. The ARM-A53 platform shows the lowest throughput, struggling to handle multiple tasks efficiently, especially in complex multi-UAV operations.

Real-time processing demands increase with larger UAV formations. Optimizations in task parallelization and data handling are being investigated to ensure that the hardware can manage higher task loads without sacrificing responsiveness.

(4): Fault tolerance and reliability

In complex flight environments, these two aspects are essential for maintaining system stability, as they assess different dimensions of platform performance. Fault tolerance focuses on how each platform manages system faults, communication errors, and unexpected failures during multi-UAV tasks, ensuring continuous operation with minimal disruption. Reliability, by contrast, evaluates the platform’s consistency in performing without failures over extended periods, particularly under challenging mission conditions.

As shown in Figure 16, the ZCU104 platform exhibits superior fault tolerance due to its combination of ARM processing and the FPGA’s flexibility. This allows the system to handle unexpected faults with minimal disruption, contributing to more stable UAV operations in dynamic conditions. In terms of reliability, ZCU104 also performs well, maintaining consistent operation under continuous workloads. While the E5-2630v4 demonstrates strong fault tolerance, its reliance on server-grade processing limits its adaptability to sudden failures in real-time operations. The ARM-A53 platform, with its limited processing power, shows lower fault tolerance, which can lead to system instability under high-load or fault-prone conditions. Its reliability also suffers under extended operational scenarios, making it less suitable for complex, high-demand missions.

Based on the evaluation of latency, energy consumption, throughput, and fault tolerance and reliability, the ZCU104 platform demonstrates a balanced performance across all metrics, making it the optimal choice for multi-UAV coordination and control tasks. Its competitive latency, efficient energy consumption, and stability in dynamic conditions make it well-suited for flight control missions. While the E5-2630v4 excels in computational power and throughput, its high energy consumption and relatively higher latency limit its suitability for real-time UAV control, rendering it more appropriate for ground-based simulations. The ARM-A53, despite its low energy consumption, lacks the necessary processing power, throughput, and fault tolerance to meet the demands of complex multi-UAV tasks, leading to its exclusion from further consideration.

4.2.3. Algorithm Performance and Data Analysis

This study provides a detailed analysis of the performance of the MADDPG–VSC and MADDPG algorithms in multi-UAV formation tasks, focusing on two key dimensions, i.e., coordination and real-time responsiveness. The analysis includes trajectory tracking error, formation stability, response time, and data transmission efficiency.

(1): Coordination evaluation

The evaluation of coordination focuses on two critical factors: trajectory tracking error and formation stability, which together measure the multi-UAV system’s ability to maintain the formation and accurately follow the planned trajectory during mission execution.

Trajectory tracking error

Figure 17 compares the trajectory tracking error of the center point of the UAV formation with the waypoints. The MADDPG–VSC algorithm demonstrated a smaller average error compared to that of the MADDPG algorithm. In scenarios with sharp turns and sudden changes, the MADDPG–VSC algorithm’s adaptive control strategies helped reduce tracking errors, ensuring better accuracy.

Formation stability

The stability of the UAV formation is measured by calculating the standard deviation of the relative distances between UAVs. Table 12 compares the formation stability of the MADDPG–VSC and MADDPG algorithms at various waypoints. The MADDPG–VSC algorithm consistently maintained lower standard deviations, indicating better stability, especially in complex mission scenarios.

As shown in Table 12, the MADDPG–VSC algorithm maintains consistently lower standard deviations of relative distances across waypoints, indicating stronger formation stability compared to that of MADDPG.

(2): Real-time responsiveness evaluation

Response time, a critical indicator of the system’s ability to quickly adjust, is measured as the time difference between when the GCS sends a command and receives feedback. As shown in Table 13, the MADDPG–VSC algorithm consistently outperformed the MADDPG algorithm, achieving an average response time of 1.2 s compared to 1.8 s for MADDPG, across various scenarios.

Data transmission efficiency

The efficiency of data transmission is assessed by measuring the communication delay and transmission rate between the GCS and the UAVs. As shown in Table 14, the MADDPG–VSC algorithm demonstrated lower transmission delays, averaging 0.2 s with a transmission rate of 1.40 Mbps, compared to those of the MADDPG algorithm, which showed a delay of 0.3 s and a transmission rate of 1.30 Mbps.

Through the analysis of coordination and real-time responsiveness, the MADDPG–VSC algorithm demonstrated significant advantages in multi-UAV formation tasks. It outperformed the MADDPG algorithm in terms of trajectory tracking error, formation stability, response time, and data transmission, particularly when handling complex and dynamic tasks where precise execution and stability are critical.

4.2.4. Key Insights and Implications

The validation experiments highlight the significant advantages of the MADDPG–VSC algorithm in multi-UAV formation tasks. Key performance indicators such as trajectory accuracy, formation stability, and real-time responsiveness demonstrate its ability to maintain precise control and coordination under dynamic conditions. The system’s low latency and efficient data transmission further reinforce its effectiveness in real-time decision making.

These findings emphasize the innovations of the MADDPG–VSC algorithm, particularly in regards to adaptive sweep angle control and cooperation strategies. These enhancements provide a strong foundation for improving UAV performance in complex, real-world environments, ensuring reliable mission execution across a range of scenarios.

4.3. Comparative Analysis of Simulation and Flight Validation

(1): Trajectory and formation stability

The trajectory tracking accuracy and formation stability observed during real-world tests closely matched the simulated performance. Although some deviations were inevitable due to environmental factors, the average trajectory error and formation deviations remained within an acceptable range, underscoring the robustness of the control strategies implemented by the MADDPG–VSC algorithm.

(2): Hardware performance consistency

The hardware platforms, particularly the ZCU104 task controller, demonstrated stable performance across both simulation and real-world tests. The low latency, efficient data processing, and real-time control capabilities highlighted in the simulations were reflected in the flight tests, further validating the hardware selection.

(3): Response to innovation points

The adaptive sweep angle control and UAV cooperation mechanisms, which form the core innovations of the MADDPG–VSC algorithm, were shown to enhance the flexibility and stability of UAV formation in dynamic environments. While these features were rigorously tested in the simulations, the flight validation confirmed their effectiveness in real-world scenarios, particularly in maintaining precise control during sudden maneuvers or obstacle avoidance.

(4): Insights and implications for future work

The validation of the MADDPG–VSC algorithm through both simulation and flight testing provides confidence in its applicability for complex, real-world multi-UAV missions. Its ability to maintain formation stability, trajectory accuracy, and efficient obstacle avoidance in dynamic and unpredictable environments highlights its potential for broader implementation in UAV operations.

5. Conclusions

This study has made significant advancements in the modeling and control of variable-sweep wing UAVs, particularly through the refinement of longitudinal dynamic equations and the incorporation of maneuver-specific parameters from the L-30A model. These developments enabled accurate calculations of aerodynamic properties, such as optimal lift-to-drag ratios, and established a reliable model to represent the wing morphing process.

Additionally, the introduction of the MADDPG–VSC algorithm offers a novel approach for optimizing sweep angles in response to real-time flight states. This methodology enhances multi-UAV cooperative formation control, which is crucial for complex, real-world missions. Flight experiments and simulations show that the algorithm maintains precision, coordination, and stability in dynamic conditions, with minimal computational demands.

The MADDPG–VSC algorithm effectively addresses core challenges in UAV formation control, such as managing continuous action spaces and adapting to dynamic, unpredictable environments. By enhancing flight precision and formation stability, the algorithm shows strong potential for broader application in multi-UAV systems, particularly for tasks requiring high adaptability and scalability. This creates a robust framework for future use in increasingly complex flight environments.

In future work, real-world flight tests will be conducted under complex conditions, including varying weather and the presence of dynamic obstacles, to validate the algorithm’s robustness in environments significantly different from the training data. An evaluation framework will be established to assess the impacts of unexpected changes, including sudden airflow shifts and target movements, on formation stability and control. In addition, future simulations will incorporate more complex and irregular obstacles to better reflect real-world challenges, such as sensor noise and communication delays. Efforts will also focus on improving the system’s scalability and adaptability to support larger UAV formations and enhancing real-time decision making to further improve performance and operational reliability.

Author Contributions

Conceptualization, Z.C. and G.C.; methodology, G.C.; software, Z.C.; validation, Z.C. and G.C.; formal analysis, Z.C.; investigation, Z.C.; resources, Z.C.; data curation, Z.C.; writing—original draft preparation, Z.C.; writing—review and editing, G.C.; visualization, Z.C.; supervision, G.C.; project administration, G.C.; funding acquisition, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, Grant Nos. 92371201 and 52192633, and the Natural Science Foundation of Shaanxi Province, Grant No. 2022JC-03.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Zhengyang Cao worked for Xi’an ASN UAV Technology Co., Ltd. Gang Chen declares no conflict of interest.

References

Botez, R.M.; Dao, T.M.; Elelwi, M.; Kuitche, M.A. Comparison and analyses of a variable span-morphing of the tapered wing with a varying sweep angle. Aeronaut. J. 2020, 124, 1146–1169. [Google Scholar] [CrossRef]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. Syst. Man Cybern. Syst. 1968, 4, 100–107. [Google Scholar] [CrossRef]
LaValle, S.M. Planning Algorithms; Cambridge University Press: Cambridge, UK, 2006; pp. 23–62. [Google Scholar]
Kavraki, L.E.; Svestka, P.; Latombe, J.C.; Overmars, M.H. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 1996, 12, 566–580. [Google Scholar] [CrossRef]
Shadeed, O.; Hasanzade, M.; Koyuncu, E. Deep Reinforcement Learning based Aggressive Flight Trajectory Tracker. In Proceedings of the AIAA SciTech Forum 2021, Virtual Online, 11–15 January 2021. 1 PartF. [Google Scholar]
Bolinches, M.; Forrester, A.I.J.; Keane, A.J.; Scanlan, J.P.; Takeda, K. Design, analysis and experimental validation of a morphing UAV wing. Aeronaut. J. 2011, 115, 761–765. [Google Scholar] [CrossRef]
Valasek, J.; Tandale, M.D.; Rong, J. A Reinforcement Learning—Adaptive Control Architecture for Morphing. J. Aerosp. Comput. Inf. Commun. 2005, 2, 174–195. [Google Scholar] [CrossRef]
Lampton, A.; Niksch, A.; Valasek, J. Reinforcement Learning of a Morphing Airfoil-Policy and Discrete Learning Analysis. J. Aerosp. Comput. Inf. Commun. 2010, 7, 241–260. [Google Scholar] [CrossRef]
Lampton, A.; Niksch, A.; Valasek, J. Reinforcement Learning of Morphing Airfoils with Aerodynamic and Structural Effects. J. Aerosp. Comput. Inf. Commun. 2009, 6, 30–50. [Google Scholar] [CrossRef]
Valasek, J.; Doebbler, J.; Tandale, M.D.; Meade, A.J. Improved Adaptive–Reinforcement Learning Control for Morphing Unmanned Air Vehicles. IEEE Trans. Syst. Man Cybern. B 2008, 38, 1014–1020. [Google Scholar] [CrossRef] [PubMed]
Xu, D.; Hui, Z.; Liu, Y.; Chen, G. Morphing control of a new bionic morphing UAV with deep reinforcement learning. Aerosp. Sci. Technol. 2019, 92, 232–243. [Google Scholar] [CrossRef]
Yan, B.; Li, Y.; Dai, P.; Liu, S. Aerodynamic Analysis, Dynamic Modeling, and Control of a Morphing Aircraft. J. Aerosp. Eng. 2019, 32, 04019058. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Liu, F.; Lan, T.; Wang, S. A collaborative control method for multi-UAVs based on ADRC control. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 7014–7019. [Google Scholar]
Ouamri, M.A.; Barb, G.; Singh, D.; Adam, A.B.M.; Muthanna, M.S.A.; Li, X. Nonlinear Energy-Harvesting for D2D Networks Underlaying UAV with SWIPT Using MADQN. IEEE Commun. Lett. 2023, 27, 1804–1808. [Google Scholar] [CrossRef]
Ji, H.; Tong, L. Multi-body dynamic modelling and flight control for an asymmetric variable sweep morphing UAV. Aeronaut. J. 2014, 118, 683–706. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; Witt, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
Liu, X.; Yin, Y.; Su, Y.; Ming, R. A Multi-UCAV Cooperative Decision-Making Method Based on an MAPPO Algorithm for Beyond-Visual-Range Air Combat. Aerospace 2022, 9, 56. [Google Scholar] [CrossRef]
Isci, H.; Koyuncu, E. Reinforcement Learning Based Autonomous Air Combat with Energy Budgets. In Proceedings of the AIAA SciTech Forum 2022, San Diego, CA, USA; Online, 3–7 January 2022. Autonomy I. [Google Scholar]
Du, Y.; Li, F.; Zandi, H.; Xue, Y. Approximating Nash Equilibrium in Day-ahead Electricity Market Bidding with Multi-agent Deep Reinforcement Learning. J. Mod. Power Syst. Clean Energy 2021, 9, 534–544. [Google Scholar] [CrossRef]
Koch, W.; Mancuso, R.; West, R.; Bestavros, A. Reinforcement Learning for UAV Attitude Control. ACM Trans. Cyber-Phys. Syst. 2019, 3, 1–21. [Google Scholar] [CrossRef]
Zhen, Z.; Zhu, P.; Xue, Y.; Ji, Y. Distributed intelligent self-organized mission planning of multi-UAV for dynamic targets cooperative search-attack. Chin. J. Aeronaut. 2019, 32, 2706–2716. [Google Scholar] [CrossRef]
Wang, C.; Wu, L.; Yan, C.; Wang, Z.; Long, H.; Yu, C. Coactive design of explainable agent-based task planning and deep reinforcement learning for human-UAVs teamwork. Chin. J. Aeronaut. 2020, 33, 2930–2945. [Google Scholar] [CrossRef]
Bekar, C.; Yuksek, B.; Inalhan, G. High Fidelity Progressive Reinforcement Learning for Agile Maneuvering UAVs. In Proceedings of the AIAA SciTech Forum 2020, Orlando, FL, USA, 6–10 January 2020. 2020-0898. [Google Scholar]
Elkins, J.G.; Sood, R.; Rumpf, C. Bridging Reinforcement Learning and Online Learning for Spacecraft Attitude Control. J. Aerosp. Inf. Syst. 2021, 19, 62–69. [Google Scholar] [CrossRef]
Duryea, E.; Ganger, M.; Hu, W. Exploring Deep Reinforcement Learning with Multi Q-Learning. Intell. Control Autom. 2016, 7, 129–144. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Z.; Ma, Y.; Sun, R.; Xu, Z. Research on autonomous formation of Multi-UAV based on MADDPG algorithm. In Proceedings of the 2022 IEEE 17th International Conference on Control & Automation (ICCA), Naples, Italy, 27–30 June 2022; pp. 249–254. [Google Scholar]
Bouhamed, O.; Ghazzai, H.; Besbes, H.; Massoud, Y. Autonomous UAV Navigation: A DDPG-Based Deep Reinforcement Learning Approach. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar]
Zhou, W.; Li, J.; Liu, Z.; Shen, L. Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning. Chin. J. Aeronaut. 2022, 35, 100–112. [Google Scholar] [CrossRef]
Chen, Y.J.; Chang, D.K.; Zhang, C. Autonomous Tracking Using a Swarm of UAVs: A Constrained Multi-Agent Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2020, 69, 13702–13717. [Google Scholar] [CrossRef]
Yan, B.; Dai, P.; Liu, R.; Xing, M.; Liu, S. Adaptive super-twisting sliding mode control of variable sweep morphing aircraft. Aerosp. Sci. Technol. 2019, 92, 198–210. [Google Scholar] [CrossRef]
Hernandez-Leal, P.; Kartal, B.; Taylor, M.E. A survey and critique of multiagent deep reinforcement learning. Auton. Agents Multi-Agent Syst. 2019, 33, 750–797. [Google Scholar] [CrossRef]
Gaudet, B.; Linares, R.; Furfaro, R. Six Degree-of-Freedom Hovering over an Asteroid with Unknown Environmental Dynamics via Reinforcement Learning. In Proceedings of the AIAA SciTech Forum 2020, Orlando, FL, USA, 6–10 January 2020. 2020-0953. [Google Scholar]
Kashyap, V.; Vepa, R. Reinforcement learning based Linear quadratic Regulator for the Control of a Quadcopter. In Proceedings of the AIAA SciTech Forum 2023, National Harbor, MD, USA; Online, 23–27 January 2023. 2023-0014. [Google Scholar]
Webb, K.; Rogers, J. Adaptive Control Design for Multi-UAV Cooperative Lift Systems. J. Aircr. 2021, 58, 1302–1322. [Google Scholar] [CrossRef]
Dai, J.; Wang, S.; Jang, Y.; Wu, X.; Cao, Z. Multi-UAV cooperative formation flight control based on APF & SMC. In Proceedings of the 2017 2nd International Conference on Robotics and Automation Engineering (ICRAE), Shanghai, China, 29–31 December 2017; pp. 222–228. [Google Scholar]
Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit. Available online: https://docs.xilinx.com/v/u/en-US/ug1267-zcu104-eval-bd (accessed on 9 October 2018).

Figure 1. Schematic of dynamic relationships in a variable-sweep wing UAV.

Figure 2. Diagram of the variable-sweep wing UAV and mass centers. (a) Structural diagram of the L-30A variable-sweep wing UAV. (b) Schematic of mass centers and position vectors.

Figure 3. Rotating mechanism at the wing–fuselage junction of the L-30A UAV. (a) Schematic of the rotation mechanism. (b) Schematic of the wing–fuselage junction detail.

Figure 4. Aerodynamic characteristics of the L-30A variable-sweep wing UAV. (a) Lift and drag coefficients. (b) Lift-to-drag ratio.

Figure 5. Schematic of cooperative formation control in variable-sweep wing UAV system.

Figure 6. Structure of the MADDPG–VSC algorithm model.

Figure 7. Schematic representation of the simulation environment. (a) Terrain schematic; (b) simple scenario; (c) complex scenario.

Figure 8. UAV trajectories using MADDPG, MADDPG–VSC, and SAC algorithms. (a) MADDPG algorithm; (b) MADDPG–VSC algorithm; (c) SAC algorithm.

Figure 9. Training reward curves for MADDPG, MADDPG–VSC, and SAC algorithms.

Figure 10. Network parameter variations using MADDPG, MADDPG–VSC, and SAC algorithms. (a) MADDPG algorithm; (b) MADDPG–VSC algorithm; (c) SAC algorithm.

Figure 11. Schematic of the L-30A UAV platform and its components: (a) L-30A UAV system; (b) sensors and safety mechanisms.

Figure 12. Experimental hardware platform and task controller. (a) Hardware platform; (b) task controller.

Figure 13. Scenario map of the formation flight trajectories.

Figure 14. Latency comparison across different hardware platforms.

Figure 15. Energy consumption comparison across hardware platforms.

Figure 16. Fault tolerance and reliability comparison across hardware platforms.

Figure 17. Trajectory tracking error comparison between MADDPG and MADDPG–VSC. (a) Trajectory of MADDPG; (b) trajectory of MADDPG–VSC.

Table 1. Flight performance parameters of L-30 series UAVs.

Parameters	L-30A Variable-Sweep Wing UAV	L-30 Fixed-Wing UAV
Takeoff weight	75 kg	70 kg
Service ceiling	6800 m	7000 m
Velocity	100–120 km/h (cruising)	100–120 km/h (cruising)
Velocity	180 km/h (maximum)	150 km/h (maximum)
Endurance	4.8 h	5 h

Table 2. Geometric parameters of L-30A UAV’s two wing configurations.

Parameters	Wing Configurations
Parameters	16°	60°
Longitudinal Reference Length (m)	0.80408	1.98564
Chord Length (m)	0.33448	0.68220
Wingspan (m)	1.53164	0.69072

Table 3. Flight states and sweep angle configurations of the L-30A UAV.

Flight States	Flight Velocity (km/h)	Flight Altitude (m)	Sweep Angle (°)
Takeoff	0–60	0–500	0–10
Climb	60–100	500–1000	10–20
Cruise	120–140	1000–2000	15–25
Maneuver	100–140	500–1500	20–35
Dive	120–160	0–1000	40–50
Vertical Attack	160–180	0–500	50–60
Landing	0–60	0–500	0–10

Table 4. Configurations of hyperparameters.

Parameters	Values
Max Episodes	3000
Maximum Number of Steps T	200
Discount Factor γ	0.2
Critic Learning Rate η₁	0.002
Size of Buffer U	5000
Batch Size	32
Actor Learning Rate η₂	0.001

Table 5. UAV₁ absolute position error.

Algorithm	Maximum Deviation (m)	Average Deviation (m)	Median Error (m)	Total Deviation (m)
MADDPG	24.06	11.13	6.04	25.31
MADDPG–VSC	18.17	8.84	5.84	21.50
SAC	19.00	9.00	7.00	22.00

Table 6. UAV₂ absolute position error.

Algorithm	Maximum Deviation (m)	Average Deviation (m)	Median Error (m)	Total Deviation (m)
MADDPG	13.88	4.07	7.78	16.43
MADDPG–VSC	9.75	3.12	1.69	11.83
SAC	10.00	4.00	3.50	12.50

Table 7. UAV₃ absolute position error.

Algorithm	Maximum Deviation (m)	Average Deviation (m)	Median Error (m)	Total Deviation (m)
MADDPG	12.38	3.91	2.10	14.66
MADDPG-VSC	8.79	2.55	1.57	10.55
SAC	9.05	3.15	2.50	12.50

Table 8. Execution time comparison across platforms in default training mode.

Hardware Platforms	Model	Runtime (ms)	Frequency (GHz)
ARM	ARM-A53	0.5550	1
ARM+FPGA	ZCU104	0.3550	1

Table 9. Training runtime comparison across platforms.

Hardware Platforms	Model	Runtime (ms)	Average Energy Consumption (W)
ARM+FPGA	ZCU104	0.78	15
ARM	ARM-A53	15.2	5
CPU	E5-2630v4	0.45	85

Table 10. Coordinates of ground truth waypoints.

Waypoint	wp1	wp2	wp3	wp4	wp5	wp6	wp7	wp8
Lon/° E	90.1437	90.1581	90.1767	90.1823	90.1618	90.1643	90.1520	90.1431
Lat/° N	38.3884	38.3939	38.4016	38.3901	38.3838	38.3763	38.3721	38.3882

Table 11. Throughput comparison of hardware platforms in default training mode.

Hardware Platforms	Model	Throughput (Tasks/s)
ARM	ARM-A53	100
ARM+FPGA	ZCU104	150
CPU	E5-2630v4	200

Table 12. Formation stability comparison at different waypoints.

Waypoint	wp1	wp2	wp3	wp4	wp5	wp6	wp7	wp8
MADDPG	20.6 m	25.3 m	27.8 m	23.5 m	26.7 m	23.3 m	26.2 m	23.3 m
MADDPG–VSC	14.5 m	16.9 m	17.8 m	14.5 m	17.2 m	16.8 m	16.9 m	14.4 m

Table 13. Response time comparison between MADDPG–VSC and MADDPG in different scenarios.

Task Scenario	Avg Response Time (s)
Task Scenario	MADDPG–VSC	MADDPG
Formation Adjustment	1.2	1.8
Obstacle Avoidance	1.3	2.0
Trajectory Recalibration	1.1	1.7

Table 14. Data transmission delay and speed comparison between MADDPG–VSC and MADDPG.

Algorithm	Data Transmission Delay (s)	Data Transfer Rate (Mbps)
MADDPG–VSC	0.2	1.4
MADDPG	0.3	1.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, Z.; Chen, G. Advanced Cooperative Formation Control in Variable-Sweep Wing UAVs via the MADDPG–VSC Algorithm. Appl. Sci. 2024, 14, 9048. https://doi.org/10.3390/app14199048

AMA Style

Cao Z, Chen G. Advanced Cooperative Formation Control in Variable-Sweep Wing UAVs via the MADDPG–VSC Algorithm. Applied Sciences. 2024; 14(19):9048. https://doi.org/10.3390/app14199048

Chicago/Turabian Style

Cao, Zhengyang, and Gang Chen. 2024. "Advanced Cooperative Formation Control in Variable-Sweep Wing UAVs via the MADDPG–VSC Algorithm" Applied Sciences 14, no. 19: 9048. https://doi.org/10.3390/app14199048

APA Style

Cao, Z., & Chen, G. (2024). Advanced Cooperative Formation Control in Variable-Sweep Wing UAVs via the MADDPG–VSC Algorithm. Applied Sciences, 14(19), 9048. https://doi.org/10.3390/app14199048

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advanced Cooperative Formation Control in Variable-Sweep Wing UAVs via the MADDPG–VSC Algorithm

Abstract

1. Introduction

2. Dynamics Analysis of the Variable-Sweep Wing UAV

2.1. Multi-Rigid-Body Dynamic Model

2.2. Dynamic Characteristics Analysis of the L-30A UAV

3. Cooperative Control of Variable-Sweep Wing UAV via MADDPG

3.1. Application of MARL in Variable-Sweep Wing UAV

3.2. Optimization of the MADDPG–VSC Algorithm for Control Design

3.2.1. Reward Function Design

3.2.2. Exploration and Learning Efficiency

3.2.3. Convergence and Stability

3.2.4. Adaptive Sweep Angle Control Strategy

3.2.5. Flight States of UAV and Control Strategies

3.3. Structure and Training Process of the MADDPG–VSC Algorithm

3.3.1. Multi-Agent Algorithm Model Structure

3.3.2. Training Process

3.4. Simulation Verification and Performance Analysis of the MADDPG–VSC

3.4.1. Simulation Environment and Parameter Setting

3.4.2. Algorithmic Performance Evaluation

3.4.3. Comparative Analysis of Learning Efficiency and Convergence

4. Flight Validation and Performance Analysis

4.1. Experimental Platform Overview

4.1.1. L-30A UAV System

4.1.2. Controller Selection and Algorithm Integration

4.2. Flight Validation and Performance Analysis

4.2.1. Experiment Process Overview

4.2.2. Hardware Platform Ground Testing and Analysis

4.2.3. Algorithm Performance and Data Analysis

4.2.4. Key Insights and Implications

4.3. Comparative Analysis of Simulation and Flight Validation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI