Open AccessArticle

Deep Reinforcement Learning Algorithm with Long Short-Term Memory Network for Optimizing Unmanned Aerial Vehicle Information Transmission

Yufei He

^1,†,

Ruiqi Hu

^2,†,

Kewei Liang

^3,*,

Yonghong Liu

³ and

Zhiyuan Zhou

⁴

Polytechnic Institute, Zhejiang University, Hangzhou 310015, China

Department of Applied Mathematics, Hong Kong Polytechnic University, Hong Kong, China

School of Mathematical Sciences, Zhejiang University, Hangzhou 310058, China

⁴

Applied Mathematics, Beijing Normal University—Hong Kong Baptist University United International College, Zhuhai 519087, China

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(1), 46; https://doi.org/10.3390/math13010046

Submission received: 29 November 2024 / Revised: 20 December 2024 / Accepted: 23 December 2024 / Published: 26 December 2024

(This article belongs to the Special Issue Artificial Intelligence: Large Language Models and Big Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

The optimization of information transmission in unmanned aerial vehicles (UAVs) is essential for enhancing their operational efficiency across various applications. This issue is framed as a mixed-integer nonconvex optimization challenge, which traditional optimization algorithms and reinforcement learning (RL) methods often struggle to address effectively. In this paper, we propose a novel deep reinforcement learning algorithm that utilizes a hybrid discrete–continuous action space. To address the long-term dependency issues inherent in UAV operations, we incorporate a long short-term memory (LSTM) network. Our approach accounts for the specific flight constraints of fixed-wing UAVs and employs a continuous policy network to facilitate real-time flight path planning. A non-sparse reward function is designed to maximize data collection from internet of things (IoT) devices, thus guiding the UAV to optimize its operational efficiency. Experimental results demonstrate that the proposed algorithm yields near-optimal flight paths and significantly improves data collection capabilities, compared to conventional heuristic methods, achieving an improvement of up to 10.76%. Validation through simulations confirms the effectiveness and practicality of the proposed approach in real-world scenarios.

Keywords:

unmanned aerial vehicle (UAV); deep reinforcement learning (DRL); long short-term memory (LSTM); optimal control; nonconvex optimization

MSC:

68T07

1. Introduction

In recent years, the scope of applications for unmanned aerial vehicles (UAVs) has significantly expanded, which is driven by advancements in communication and internet of things (IoT) technologies [1,2]. UAVs present distinct advantages in wireless communication compared to traditional network equipment [3]. They can be deployed in various environments for data collection tasks, are unaffected by terrain, and provide a broader field of view from high altitudes [4]. This capability enhances regional coverage, communication network reliability, and data transmission speed [5]. For instance, in smart cities, UAVs are increasingly used for efficient data collection from ground-based IoT devices, contributing to improved urban monitoring and management systems. However, UAVs encounter challenges, including limited energy during flight, small size, rapid movement, and dynamic network conditions, which lead to unstable signal connections with ground-based IoT devices [6].

To enhance the efficiency of UAV information transmission under various constraints, researchers have increasingly adopted deep reinforcement learning (DRL), which combines the capabilities of deep learning (DL) and reinforcement learning (RL). DRL has proven effective in solving both continuous and discrete space problems, which provides solutions for perception and decision-making in complex environments for UAVs [7,8]. David et al. [9] introduced the Deterministic Policy Gradient (DPG) algorithm to effectively address continuous action problems. This algorithm facilitates training compared to stochastic policies, particularly in high-dimensional action spaces. Mnih, V. et al. [10] developed the Deep Q-Network (DQN) algorithm, which integrates Q-learning and DL to address the dimensionality explosion associated with high-dimensional inputs in discrete action and state spaces. However, it fails to effectively handle continuous state and action spaces. Schulman et al. [11] introduced the Proximal Policy Optimization (PPO) algorithm based on policy gradients, which incorporates importance sampling ratio trimming. However, PPO only uses the most recent policy interactions for training, neglecting the potential benefits of prior interactions. The Actor–Critic (AC) algorithm [12,13,14] effectively manages continuous action spaces. It consists of an Actor network responsible for generating actions and a Critic network responsible for evaluating the value of those actions. The loss function is employed to iteratively update the network parameters to obtain an optimal action strategy. However, a notable challenge is the slow convergence of the Critic network, which can adversely affect the performance of the Actor network. To address this issue, Lillicrap, T.P. et al. [15] proposed the Deep Deterministic Policy Gradient (DDPG) algorithm, which employs a dual network structure with parameters updated in a sliding average manner. This strategy promotes more stable learning dynamics.

Despite advancements in RL algorithms, existing methods struggle to address the challenges of path planning and information collection in UAV operations, particularly when dealing with hybrid action spaces that combine discrete and continuous decisions. These limitations necessitate the development of novel approaches tailored to the unique requirements of UAV systems.

In this paper, we introduce a novel approach for optimizing the information transmission of UAVs through the integration of DRL and long short-term memory (LSTM) networks. The proposed algorithm aims to identify near-optimal information collection paths while adhering to various operational constraints. The main contributions of this paper are as follows:

We establish a robust simulation environment by integrating a channel and energy consumption model tailored for fixed-wing UAVs, creating a comprehensive mathematical framework for UAV information transmission.
Our novel DRL framework addresses the optimal control problem with mixed variables, utilizing two policy networks: a continuous policy for path control and a discrete policy for information collection.
We incorporate an LSTM network into the state representation to enhance predictive capabilities and enable effective global optimization through systematic state adjustments.
We optimize algorithm parameters through extensive experimentation and develop a dynamic reward function to mitigate reward sparsity, enhancing learning efficiency. Simulation results demonstrate that our algorithm achieves near-optimal data collection paths while meeting speed and energy constraints.

The remaining sections of the paper are organized as follows: Section 2 reviews prior research related to the optimal control of UAV information transmission. Section 3 details the model developed in this study, and Section 4 describes the architecture of the proposed algorithm. Section 5 presents the validation of the algorithm’s performance through simulation experiments and analyzes the obtained results. Finally, Section 6 concludes the paper and discusses future research.

2. Related Work

In the wake of advancements in UAV technology, the optimization problem of UAV-assisted wireless communication has become a topic of active research. Here, we review some relevant studies in this field.

Z. Na et al. [16] proposed the use of Orthogonal Frequency Division Multiplexing (OFDM) in UAV uplink communication systems. This method divides the channel into orthogonal subchannels [17], reducing interference and optimizing user scheduling, channel allocation, and trajectory planning for UAVs. Zeng Y et al. [18] considered both communication throughput and energy consumption in their study. They developed a differential equation model that relates the energy consumption of fixed-wing UAVs to their speed and acceleration. By designing continuous-time flight trajectories, they achieved reduced energy consumption and increased communication throughput. Lee J-H et al. [19] focused on free-space optical communication systems and considered flight constraints such as data transmission rate, propulsion energy consumption, and atmospheric environment. By treating UAV trajectory as a discrete-time model, they optimized the trajectory to meet these flight constraints.

DRL has gained popularity in the field of UAVs and wireless communications due to its powerful learning and control capabilities. Liu Chi Harold et al. [20] proposed DRL-EC3, a DRL-based algorithm that controls a group of UAVs as an airborne base station. The algorithm aims to minimize energy consumption while ensuring communication coverage between the UAVs. Yin S et al. [21] used the deterministic policy gradient (DPG) RL algorithm to optimize UAV trajectories and improve uplink transmission power, even in scenarios with unknown ground user movement, user transmission power, and channel parameters. Liu Qian et al. [22] applied RL to mobile edge computing networks mounted on UAVs. They developed a DDPN-based algorithm that dynamically plans the flight trajectory based on the position of mobile terminal users while considering limited UAV energy. The algorithm ensures quality of service for the mobile terminal users. Xiong Z et al. [23] proposed an algorithm model based on Q-learning and DQN to optimize wireless energy and data transmission strategies in a UAV-assisted large-scale machine-type communication data transmission system. Yuan X et al. [24] designed a new dual-delay deep deterministic policy gradient (TD3) based on DRL to improve the wireless surveillance capability, eavesdropping success rate, and average eavesdropping rate of fixed-wing UAVs. Zhu X et al. [25] proposed a security information transmission solution based on RL and location confusion algorithm (RLPC-SIT) to achieve a secure data transmission between UAVs. Wan P et al. [26] propose a robust memory-based softmax deep double deterministic policy gradients (MSD3) approach, which enables the UAV to adaptively collect delay-tolerant data from randomly distributed ground devices starting from any arbitrary point. However, these methods face limitations when applied to UAV operations, particularly in addressing the complexities of hybrid action spaces and the demands of continuous-time trajectory optimization, which are critical for effective path planning and information collection.

While many studies have applied DRL models to optimization problems in UAV wireless communication, fewer have focused on continuous-time models of UAV trajectories [27,28]. Additionally, most studies have used Recurrent Neural Networks (RNNs) to process time series data [29,30]. However, traditional RNNs suffer from issues such as error accumulation and increased computational burden when dealing with long sequences. In contrast, LSTM networks are an improved type of RNN that addresses these challenges [31,32,33]. By incorporating LSTM networks into state coding, our study develops a DRL method that uses two policy networks to control UAV paths and information collection, respectively.

3. Model Description

As shown in Figure 1, there are N IoT devices distributed in the region

Ω

. A fixed-wing UAV equipped with K channels is utilized for data collection from IoT devices within the designated region, operating without the capability to maintain a stationary position mid-flight.

Given the ground origin

[0, 0, 0]

, the position of the UAV at time t is defined as

\begin{matrix} \begin{matrix} l_{u a v} (t) = [l_{u a v}^{x} (t), l_{u a v}^{y} (t), h], \end{matrix} \end{matrix}

(1)

where

l_{u a v} (t) \in R^{3}

l_{u a v}^{x} (t)

and

l_{u a v}^{y} (t)

are the distances of the UAV from the origin in the x-axis and y-axis, and h is the flying height. Similarly, the position of IoT device n is defined as

\begin{matrix} \begin{matrix} z_{n} (t) = [z_{n}^{x} (t), z_{n}^{y} (t), 0], \end{matrix} \end{matrix}

(2)

where

z_{n} (t) \in R^{3}, n \in {1, \dots, N}

is the set of IoT device numbers.

3.1. Channel Modeling

In order to avoid co-channel interference and further improve spectral efficiency, an uplink channel system based on OFDM is considered [34], where each channel can only be occupied by one IoT device at each moment.

The communication link between UAVs and terrestrial IoT devices relies on radio wave transmission, susceptible to environmental factors leading to path loss at the transmitter and receiver ends. The significant channel fading phenomenon, including path loss and average signal energy attenuation between the UAV and the n-th IoT device, is measured by

\begin{matrix} \begin{matrix} d_{n} (t) = \{\begin{matrix} | | z_{n} - l_{u a v} (t) {| |}^{- α}, & LoS Link, \\ ρ | | z_{n} - l_{u a v} (t) {| |}^{- α}, & NLoS Link, \end{matrix} \end{matrix} \end{matrix}

(3)

where LoS Link represents line-of-sight link connectivity, while NLoS Link represents non-line-of-sight link connectivity. The parameter

α \in [2, 4]

represents the decay index of the environmental path, which varies depending on the environment. It typically assumes higher values in urban areas, where obstructions are more prevalent, and lower values in rural areas, which have fewer obstacles. The parameter

ρ \in [0, 1)

indicates the supplementary attenuation attributed to NLoS connections [35].

The connectivity between an IoT device (denoted by n) and the UAV is contingent upon the environment’s characteristics such as type, obstacles, and propagation density, alongside their relative positions and elevation angles. The probability of establishing a LoS connection is determined by

\begin{matrix} \begin{matrix} P_{n}^{L o s} (t) = \frac{1}{1 + b_{1} e x p (- b_{2} (θ_{n} (t) - b_{1}))} . \end{matrix} \end{matrix}

(4)

where the constants

b_{1}

and

b_{2}

characterize the environment, discerning between rural and urban settings. And,

\begin{matrix} \begin{matrix} θ_{n} (t) = \frac{180}{π} arcsin (\frac{h}{| | z_{n} (t) - l_{u a v} (t) | |}) \end{matrix} \end{matrix}

(5)

represents the elevation angle between the n-th IoT device and the UAV [36]. Furthermore,

\begin{matrix} \begin{matrix} P_{n}^{N L o s} (t) = 1 - P_{n}^{L o s} (t) . \end{matrix} \end{matrix}

(6)

Let us define the channel gain between the IoT device and the UAV channel as

\begin{matrix} \begin{matrix} g_{n k} (t) = h_{n k} (t) \sqrt{d_{n} (t)}, \end{matrix} \end{matrix}

(7)

where

h_{n k}

represents small-scale fading. This effect entails rapid fluctuations in the signal phase and amplitude due to factors like reflection, scattering, and multipath interference over short distances or time intervals during signal propagation. We assume that

h_{n k}

follows an independently and identically distributed Rayleigh distribution, i.e.,

h_{n k} \sim e x p (1)

, for all

n \in N

and

k \in K

. Here,

K = {1, \dots, K}

denotes the set of UAV channel numbers.

The signal-to-noise ratio (SNR) for the link between the n-th IoT device and the k-th UAV channel is given by

\begin{matrix} \begin{matrix} r_{n k} = \frac{p_{n k} {|h_{n k} (t)|}^{2} d_{n} (t)}{ω σ^{2}}, \end{matrix} \end{matrix}

(8)

where

p_{n k} (t)

represents the uplink transmission power of the n-th IoT device on the k-th UAV channel at time t,

ω

represents the channel spacing, and

σ^{2}

signifies noise power density.

The maximum uplink transmission power, denoted as

p^{u p l i n k}

, is limited at IoT devices due to hardware limitations and regulatory constraints. This value is typically set as a constant in standard configurations to ensure compliance with communication standards and minimize interference within the network [37,38]. we calculate the uplink transmission power of the n-th IoT device on the k-th channel at time t as

p_{n k} (t) = s_{n k} (t) p^{u p l i n k}

, where

s_{n k} (t) \in [0, 1]

indicates the proportion of maximum uplink transmission power utilized for data transmission at time t.

Correspondingly, the transmission rate of the link between the n-th IoT device and the UAV is

r_{n} (s_{n} (t), l_{u a v} (t)) = \sum_{k \in K} ω {log}_{2} (1 + r_{n k} (t)),

(9)

where

s_{n} (t) = {[s_{n k} (t)]}_{k \in K}

is proportional to the SNR. The information collected by the UAV from time

t_{1}

t_{2}

\begin{matrix} \begin{matrix} \int_{t_{1}}^{t_{2}} r_{n} (s_{n} (t), l_{u a v} (t)) d t . \end{matrix} \end{matrix}

(10)

3.2. Energy Consumption Model

The energy consumption of the UAV consists primarily of two components: communication energy consumption and flight energy consumption. The UAV’s responsibilities include data acquisition and uplink transmission power control, excluding energy consumption related to data processing. Assuming a constant value for communication energy consumption, which is defined as

p_{C}

, the flight energy consumption is dependent on the speed and acceleration of the UAV at time t. The energy consumed by UAV flight can be expressed as

\begin{matrix} \begin{matrix} \begin{matrix} p_{M} (l_{u a v} (t)) & = c_{1} {∥ {\dot{l}}_{u a v} (t) ∥}^{3} + \frac{c_{2}}{∥ {\dot{l}}_{u a v} (t) ∥} (1 + \frac{∥ {\ddot{l}}_{u a v} {(t) ∥}^{2} - \frac{{({\ddot{l}}_{u a v} {(t)}^{T} {\dot{l}}_{u a v} (t))}^{2}}{∥ {\dot{l}}_{u a v} {(t) ∥}^{2}}}{g^{2}}) \\ + m_{u a v} {\ddot{l}}_{u a v} {(t)}^{⊤} {\dot{l}}_{u a v} (t), \end{matrix} \end{matrix} \end{matrix}

(11)

where

c_{1}

and

c_{2}

are parameters associated with UAV mass, wing area, air density, gravitational acceleration (g), and UAV mass

m_{u a v}

[18].

The total energy consumed by the UAV during flight from times

t_{1}

t_{2}

\begin{matrix} \begin{matrix} e (t) = - \int_{t 1}^{t 2} p_{C} + p_{M} (l_{u a v} (t)) d t . \end{matrix} \end{matrix}

(12)

3.3. Problem Formulation

We assume that the UAV can commence data collection only when flying horizontally at a specific altitude. We then formulate an optimal control model for the mathematical problem of UAV information transmission as follows:

\begin{matrix} max_{S, l_{uav}, k} \sum_{n \in N} η_{n} \end{matrix}

(13)

\begin{matrix} s . t . & \int_{0}^{T} r_{n} (s_{n} (t), l_{uav} (t)) d t \geq η_{n} u, & \forall n \in N, \end{matrix}

(14)

\begin{matrix} \sum_{k \in K} s_{n k} \leq η_{n}, η_{n} \in {0, 1}, & \forall n \in N, \end{matrix}

(15)

\begin{matrix} s_{n k} (t) (\sum_{n^{'} \in N ∖ n} s_{n^{'} k} (t)) = 0, & \forall n \in N, \forall k \in K, \end{matrix}

(16)

\begin{matrix} \frac{d e (t)}{d t} = - [p_{C} + p_{M} (l_{uav} (t))], \end{matrix}

(17)

\begin{matrix} e (0) = e_{0} > 0, e (T) - e_{l a n d} \geq 0, \end{matrix}

(18)

\begin{matrix} {\dot{l}}_{uav} (0) = v_{0}, {\dot{l}}_{uav} (T) = v_{T}, \end{matrix}

(19)

\begin{matrix} {\ddot{l}}_{uav} (0) = a_{0}, {\ddot{l}}_{uav} (T) = a_{T}, \end{matrix}

(20)

\begin{matrix} | {\dot{l}}_{uav} (t) | \leq v_{m a x}, | {\ddot{l}}_{uav} (t) | \leq a_{m a x} . \end{matrix}

(21)

$S = {[s_{n k}]}_{n \in N, k \in K}$ represents the uplink transmission ratio of IoT devices n on channel k. We define T as the total time period during which the UAV operates to complete the data collection task.
The parameter $η_{n} \in {0, 1}$ acts as an indicator of whether the data from IoT device n is successfully collected by the UAV, where $η_{n} = 0$ indicates no connection with IoT device n, and $η_{n} = 1$ represents a successful connection and data transmission with IoT device n.
In Constraint (14), the total minimum amount of data collected by the UAV during time T is given by $η_{n} u$ , where u is the minimum required data amount from each IoT device. If the UAV fails to collect the data, it incurs a negative reward, as described by $r_{i}$ in Section 3.4.3 of the reward function.
Constraint (15) controls the channel allocation in the algorithm. Specifically, for IoT device n, if no communication channels are occupied, then $\sum_{k \in K} s_{n k} = 0$ , which implies $η_{n} = 0$ . This satisfies the constraint $\sum_{k \in K} s_{n k} \leq η_{n} = 0$ . Conversely, if IoT device n occupies a single channel k for data transmission, then $\sum_{k \in K} s_{n k} \leq 1$ , resulting in $η_{n} = 1$ . Thus, the constraint $\sum_{k \in K} s_{n k} \leq η_{n} = 1$ holds. Therefore, the constraint is valid for all IoT devices.
Constraint (16) ensures that each channel can be occupied by only one IoT device at any time.
Constraint (17) models the UAV’s energy consumption rate $e (t)$ over time, as defined in (11) and (12), enabling energy-efficient decision-making.
The variable $e_{0}$ denotes the initial energy available to the UAV, while $e_{l a n d}$ refers to the energy expended by the UAV during landing. The constraints from (18) to (21) are designed to regulate the velocity and acceleration of the UAV during flight, as detailed in Section 3.4.3 of the reward function regarding $r_{e}$ , $r_{v}$ , and $r_{a}$ .
Maximizing $η_{n}$ in the objective function (13) is essential, as it represents the successful connections and data transmissions between the UAV and IoT devices. This directly contributes to the total amount of information collected, which is the primary goal of the algorithm.

3.4. Deep Reinforcement Learning Model

Assuming that the optimal control of UAV information transmission conforms to the Markov decision process, the quaternion of the model can be represented as

< S, A, R, P >

, and the basic elements such as the state, action, and reward of the UAV are portrayed as follows.

3.4.1. States

We consider an environment where the UAV operates under the assumption of full observability and accessibility. In this setting, the UAV is tasked with both information transmission and path planning. To accomplish these objectives, the UAV must acquire its own state information as well as data from IoT devices in the environment, which are subject to dynamic changes. The state of the UAV at time t is defined as follows:

\begin{matrix} \begin{matrix} S_{t} = \{l_{u a v} (t), e_{t}, P_{t}, g_{t}\}, \end{matrix} \end{matrix}

(22)

where the UAV’s position

l_{u a v} (t)

at time t is defined in Equation (1). The term

e_{t}

denotes the remaining power of the UAV. The vector

P_{t} = [p_{1}, p_{2}, \dots, p_{N}]

indicates the probabilities of the UAV being connected to N IoT devices. Finally,

g_{t} = {g_{n k}}_{n \in N, k \in K}

encapsulates the channel information for the k-th channel of IoT device n.

3.4.2. Action

The proposed model operates within a scene characterized by a hybrid action space. In this framework, the continuous action space governs the UAV’s path planning, while the discrete action space manages the data transmission between the UAV and IoT devices. Traditional RL algorithms typically address either discrete or continuous action spaces, limiting their applicability in real-world scenarios. To address this limitation, we design a solution model based on the hybrid action space. The action at time t is defined as follows:

\begin{matrix} \begin{matrix} A_{t} = \{l_{t}, φ_{t}, n, k\}, \end{matrix} \end{matrix}

(23)

where

\{l_{t}, φ_{t}\}

controls the UAV’s trajectory. Specifically,

l_{t} \in [0, L]

denotes the distance traveled by the UAV, and

φ_{t} \in [0, π]

represents the flight angle relative to the x-axis. The pair

\{n, k\}

governs data transmission, indicating that the UAV connects to the k-th channel of IoT device n.

At time t, the UAV executes action

A_{t}

, resulting in movement by a displacement of

Δ l_{u a v} = [l_{t} c o s φ_{t}, l_{t} s i n φ_{t}, 0]

, leading to a new position

l_{u a v} (t + 1)

at time

t + 1

. This position update is expressed as

\begin{matrix} \begin{matrix} l_{u a v} (t + 1) = l_{u a v} (t) + Δ l_{u a v} = [l_{u a v}^{x} (t) + l_{t} c o s φ_{t}, l_{u a v}^{y} (t) + l_{t} s i n φ_{t}, h] . \end{matrix} \end{matrix}

(24)

The reward collected is represented as

r_{n} (s_{n k}, l_{u a v} (t + 1))

, reflecting the transmission efficiency associated with the n-th IoT device.

3.4.3. Reward Function

The total reward at time t is expressed as follows:

r_{t} = r_{d e s} + r_{v} + r_{a} + r_{e} + r_{i},

(25)

where the five components are defined as follows.

Distance reward ${r_{d e s}}$ : This term $r_{d e s} = - β ∥ l_{u a v} (t) - z_{T} ∥$ penalizes the UAV’s distance from its target location $z_{T}$ , encouraging it to minimize this distance and eventually land at the destination. Here, $β$ is a positive weighting factor [39]. The target location $z_{T}$ is a predefined point representing the UAV’s intended flight destination, which is determined during mission planning.
Velocity reward ${r_{v}}$ : The UAV incurs a penalty of $- 10$ if its velocity exceeds the specified maximum speed. Otherwise, $r_{v} = 0$ .
Acceleration reward ${r_{a}}$ : Similarly, this term assigns a penalty of $- 10$ when the UAV’s acceleration exceeds the threshold $a_{m a x}$ . If the acceleration remains within the allowable range, $r_{a} = 0$ .
Energy control reward ${r_{e}}$ : To ensure sustainable operation, this term penalizes the UAV with a value of $- 100$ when its remaining energy $e < 0$ . Otherwise, $r_{e} = 0$ .
Information collection reward ${r_{i}}$ : This component incentivizes the UAV to gather data from IoT devices. It is defined as follows:

$r_{i} = \{\begin{matrix} - 10, & r_{information} = 0 \\ λ r_{information}, & r_{information} \neq 0, \end{matrix}$

(26)

where $λ$ is a positive constant [40]. The term $r_{i n f o r m a t i o n} = \sum_{t = 0}^{T} r_{n} (s_{n} (t), l_{u a v} (t))$ quantifies the total amount of data collected by the UAV from IoT devices over T time steps, where the transmission rate of the link $r_{n} (s_{n} (t), l_{u a v} (t))$ is defined in Equation (9).

This design ensures a comprehensive reward mechanism balancing trajectory optimization, resource management, and data collection efficiency.

4. Method

Since it is difficult to solve this problem using traditional operations research methods, we propose leveraging RL in a hybrid action space to approximate an optimal solution. This approach enables the UAV to simultaneously perform data collection from IoT devices and select flight actions.

As illustrated in Figure 2, the UAV interacts with its environment and receives observations

o_{t}

at each time step, representing the raw information perceived from the environment. These observations are first processed by the UAV to represent the current state and then passed through LSTM-based state encoding to generate the encoded state

s_{t}

, which captures long-term dependencies and provides a richer representation of the environment. The encoded state

s_{t}

is then shared by both the continuous policy network

π_{θ}^{c}

and the discrete policy network

π_{θ}^{d}

, which output continuous actions

a_{t}^{c}

and discrete actions

a_{t}^{d}

, respectively. The Critic network utilizes the encoded state

s_{t}

along with the generated actions to approximate the action–state value function

Q (s, a)

, thereby guiding the optimization process for the UAV’s path control and information collection.

4.1. LSTM

In RL, the environment evolves dynamically in response to the decisions made by the intelligent agent. Relying solely on the current environment state for decision-making may result in the agent getting trapped in a local optimum. To address this issue, LSTM networks can be employed to enhance decision-making capabilities. LSTM networks are designed with recurrent loops that enable them to retain information over extended time periods. This allows the network’s previous states to influence its future behavior, effectively addressing challenges related to long-term dependencies. Consequently, LSTM networks are particularly suited for processing sequential data.

In this study, we utilize an LSTM network as the state encoding mechanism for the UAV, which is defined as follows:

s_{t} \leftarrow L S T M (s_{t}, h_{t - 1}),

(27)

where

h_{t - 1}

is the environmental information at time

t - 1

. By encoding the UAV’s state through the LSTM network, temporal dependencies within the environment can be captured effectively. This enables decision-making processes to incorporate not only the current state but also the historical information. Such an approach facilitates improved navigation of the environment and supports the UAV in making optimal choices during data collection from IoT devices.

4.2. Algorithm Based on Hybrid Action Space

In the proposed model, the UAV is tasked with both flight control and information transmission. Consequently, two distinct policy networks are required: a continuous policy network

π_{θ}^{c}

, responsible for controlling the UAV’s flight trajectory, and a discrete policy network

π_{θ}^{d}

, responsible for managing the UAV’s information collection process.

For the continuous action

a^{c} = {l_{t}, φ_{t}}

generated by the continuous policy network

π_{θ}^{c}

, the parameters

θ_{π}^{c}

are updated using a deterministic strategy gradient method:

\nabla θ_{π}^{c} = E_{s \sim u} [\nabla_{θ_{π}^{c}} π_{θ}^{c} (s) \nabla_{a^{c}} Q (s, a^{c}, a^{d})],

(28)

where u represents the marginal distribution of s.

For the discrete action

a^{d} = {n, k}

generated by the discrete policy network

π_{θ}^{d}

, the parameters

θ_{π}^{d}

are updated using a stochastic strategy gradient approach:

\nabla θ_{π}^{d} = E_{s \sim u, a^{d} \sim π_{θ}^{d}} [\nabla_{θ_{π}^{d}} l o g π_{θ}^{d} (a^{d} | s) Q (s, a^{c}, a^{d})] .

(29)

The critic network evaluates both continuous and discrete actions through the state–action value function, expressed as follows:

\begin{matrix} \begin{matrix} Q (s, a) = Q (s, a^{c}, a^{d}) = Q (s, π_{θ}^{c} (s), π_{θ}^{d} (s)) . \end{matrix} \end{matrix}

(30)

The algorithmic framework implementing this approach is presented in Algorithm 1.

Algorithm 1 DRL Algorithm for UAV Messaging

Input:: Initialize the continuous policy networks $π_{θ}^{c}$ and the discrete policy network $π_{θ}^{d}$
Input:: Initialize the critic network $Q (s, a^{c}, a^{d})$
Input:: Initialize the experience replay buffer R
Input:: Initialize the IoT device location
Output:: The planned path $l_{u a v}$ and the total information collected $r_{i n f o r m a t i o n}$
1:: for $e p i s o d e = 1, \dots, M$ do
2:: Initialize a random process $N$ for action exploration
3:: Initialize state $s_{0}$
4:: for $t = 1, \dots, T$ do
5:: Get the act $a_{t} = {π_{θ}^{c} (s_{t - 1}), π_{θ}^{d} (s_{t - 1})} + N_{t}$
6:: Calculate the reward $r_{t}$ using Equation (25) and update the state $s_{t}$ using the LSTM in Equation (27)
7:: Store experience data $(s_{t - 1}, a_{t}, r_{t}, s_{t})$ in R
8:: Sample a mini-batch ${(s_{i}, a_{i}, r_{i}, s_{i}^{'})}_{i = 1}^{L}$ for gradient descent in R
9:: Update the continuous policy network using Equation (28)
10:: Update the discrete policy network using Equation (29)
11:: Update the critic network $Q (s, a_{i})$ using gradient descent
12:: end for
13:: end for
14:: return The planned path $l_{u a v}$ and the total information collected $r_{i n f o r m a t i o n}$

5. Simulation Results and Discussion

To evaluate the performance of the proposed algorithm, we first determine its parameter values experimentally and subsequently compare its effectiveness with that of two heuristic algorithms.

In the simulation experiments, the UAV operates within a predefined flight region

Ω

, where IoT devices are randomly distributed. The size of region

Ω

is set to

[1000, 1000]

units. A fixed-wing UAV, equipped with a set number of communication channels, collects data from ground-based IoT devices while maintaining a fixed altitude of

h = 100

units. The target location

z_{T}

is set outside the region

Ω

, with the UAV’s goal being to reach the boundary of region

Ω

by flying in either the x- or y-direction beyond 1000 units. The detailed parameter settings utilized in the simulations are provided in Table 1.

Figure 3 demonstrates the process of UAV path planning and data collection using an example with two IoT devices (

n = 2

), capturing four distinct moments of the UAV’s flight trajectory along with its connection status. Figure 3A shows the UAV’s starting position. In Figure 3B, as the UAV approaches the first IoT device, it successfully establishes a connection with IoT device 1 and begins data transmission (

η_{1} = 1

), while no connection is established with IoT device 2 (

η_{2} = 0

). Figure 3C illustrates the UAV’s position between the two IoT devices, where it successfully establishes simultaneous connections with both IoT devices and performs data transmission (

η_{1} = 1

η_{2} = 1

). Finally, Figure 3D shows the UAV after passing the second IoT device, where it maintains a successful connection with IoT device 2 for data transmission (

η_{2} = 1

) but no longer connects with IoT device 1 (

η_{1} = 0

). This figure clearly illustrates the UAV’s dynamic flight process and its corresponding data collection behavior during the mission.

Figure 4 depicts the behavior of the value loss function of the proposed algorithm under three distinct learning rate settings. As shown in Figure 4, a higher learning rate leads to increased oscillations in the loss function and greater randomness in the agent’s actions, hindering convergence to an optimal value. Conversely, a lower learning rate results in slower convergence, requiring more training steps and prolonged training duration. To achieve a balance between training efficiency and performance, a learning rate of

0.001

was selected as the optimal parameter for subsequent experiments.

Figure 5 illustrates the variation of the total reward function during training, comparing the performance of our algorithm with and without the integration of an LSTM network. The reward function enables the UAV to optimize its flight control, including angle and step size, to mitigate penalties from excessive acceleration and energy depletion. Initially, the model without LSTM shows slower convergence and larger fluctuations in the reward, while the model with LSTM demonstrates better learning, resulting in higher and more stable rewards over time. Based on these results, we incorporated the LSTM into the algorithm for subsequent experiments, as it significantly improved the UAV’s ability to adhere to the predefined flight constraints and achieve optimal performance.

Figure 6 investigates the relationship between the number of communication channels and the number of terrestrial IoT devices, using scenarios with three to six IoT devices as examples. As the number of channels increases while the number of IoT devices is kept constant, the amount of data collected by the UAV also increases. However, when the number of channels matches the number of IoT devices, the data collection reaches saturation, and further increasing the number of channels does not significantly enhance data collection. Based on this analysis, we select six ground IoT devices and four UAV communication channels as the parameter values for subsequent experiments.

Since traditional RL methods cannot be applied to this problem, we compare the proposed algorithm with two heuristic algorithms: the greedy algorithm [41] and the simulated annealing algorithm [42]. We conduct five repetitions of each algorithm and average the results.

Figure 7 showcases the UAV flight paths planned by the three algorithms under four representative IoT device location distributions. Our RL algorithm dynamically adjusts the flight distance and steering angle at each step after learning while still adhering to the flight constraints, resulting in flight paths that better align with the overall distribution of the IoT devices. In contrast, the two heuristic algorithms randomly select values for the flight distance and steering angle, leading to less smooth flight paths, and in some cases, the paths fail to align with the overall IoT device distribution.

When the IoT devices are distributed along the diagonal of the region

Ω

(Figure 7A), the UAV’s flight trajectory also follows a diagonal pattern. Additionally, as the coordinate position and number distribution change, the output path will approximate an “S” shape (Figure 7B). When all IoT devices are concentrated on either the left or right side of the diagonal of the region

Ω

(Figure 7C,D), our algorithm tends to bias the UAV flight trajectory toward the side where the IoT devices are concentrated until the UAV is controlled to fly out of the restricted region

Ω

. On the other hand, the heuristic algorithm is more likely to get stuck in a local optimum in this scenario, resulting in suboptimal flight paths.

Based on Figure 8, the average amount of data collected by UAVs from ground-based IoT devices is analyzed. In the four scenarios involving IoT devices distribution, the two heuristic algorithms exhibited only a slight difference in the average amount of data collected. However, the algorithm we proposed consistently collected more data than the heuristic algorithms in all four scenarios, with an improvement of up to

10.76 %

6. Conclusions

This paper presents a comprehensive mathematical model for UAV information transmission that integrates channel dynamics and energy consumption. To tackle the optimal control problem in this context, we propose a deep reinforcement learning framework that combines hybrid discrete–continuous action spaces with LSTM networks. This novel approach addresses the limitations of traditional reinforcement learning methods, which are often restricted to single-action-space scenarios.

By leveraging the predictive capabilities of LSTM networks, our method enhances state encoding to incorporate long-term dependencies, enabling more systematic state adjustments and improved global optimization. The proposed framework employs two policy networks: a continuous policy network for UAV trajectory control and a discrete policy network for managing information collection. These actions are evaluated by a Critic network, which utilizes the UAV’s state to approximate the state–action value funct- ion effectively.

Simulation experiments demonstrate the efficacy of our algorithm in planning UAV flight paths and optimizing data collection from ground-based IoT devices. The results indicate that our approach achieves near-optimal solutions while adhering to real-world operational constraints, significantly outperforming baseline algorithms.

However, deploying the proposed algorithm in real-world UAVs presents several challenges, particularly due to hardware limitations. UAVs typically have limited computational resources and memory, which restrict the complexity of models that can be processed in real time. Additionally, the finite battery life of UAVs further constrains their ability to perform continuous optimization over extended periods. These factors require careful consideration when applying this algorithm in practical UAVs, which necessitates coordination between the algorithm and hardware conditions to ensure effective performance.

Future research will be focused on addressing several key challenges. The primary objective is to extend the framework to scenarios where UAVs return to base stations for recharging in low-energy states, ensuring task continuity post-recharge and expanded flight coverage. Furthermore, we intend to investigate collaborative approaches for multiple UAVs to work together in executing large-scale information gathering tasks.

Author Contributions

Conceptualization, Y.H., K.L. and Y.L.; methodology, Y.H. and Y.L.; software, Y.H., R.H. and Y.L.; formal analysis, Y.H., R.H. and Z.Z.; writing—original draft preparation, Y.H.; writing—review and editing, Y.H., R.H. and K.L.; supervision, K.L.; project administration, K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 42171412).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhang, M.; Wang, H.; Wu, J. On UAV source seeking with complex dynamic characteristics and multiple constraints: A cooperative standoff monitoring mode. Aerosp. Sci. Technol. 2022, 121, 107315. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, H.; Wei, J.; Yan, C.; Xiong, J.; Liu, X. Cooperative Trajectory Design of Multiple UAV Base Stations with Heterogeneous Graph Neural Networks. IEEE Trans. Wirel. Commun. 2023, 22, 1495–1509. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Othman, N.Q.H.; Li, Y.; Alsharif, M.H.; Khan, M.A. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends. Intell. Serv. Robot. 2023, 16, 109–137. [Google Scholar] [CrossRef] [PubMed]
Bayomi, N.; Fernandez, J.E. Eyes in the sky: Drones applications in the built environment under climate change challenges. Drones 2023, 7, 637. [Google Scholar] [CrossRef]
Gu, X.; Zhang, G. A survey on UAV-assisted wireless communications: Recent advances and future trends. Comput. Commun. 2023, 208, 44–78. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Guo, S.; Zhang, X.; Zheng, Y.; Du, Y. An autonomous path planning model for unmanned ships based on deep reinforcement learning. Sensors 2020, 20, 426. [Google Scholar] [CrossRef]
Azar, A.T.; Koubaa, A.; Ali, M.N.; Ibrahim, H.A.; Ibrahim, Z.F.; Kazim, M.; Ammar, A.; Benjdira, B.; Khamis, A.M.; Hameed, I.A. Drone deep reinforcement learning: A review. Electronics 2021, 10, 999. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; PMLR: London, UK, 2014; pp. 387–395. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjel, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Konda, V.; Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Process. Syst. 1999, 12, 1008–1014. [Google Scholar]
Li, B.; Wu, Y. Path planning for UAV ground target tracking via deep reinforcement learning. IEEE Access 2020, 8, 29064–29074. [Google Scholar] [CrossRef]
Khuntia, P.; Hazra, R. An actor-critic reinforcement learning for device-to-device communication underlaying cellular network. In Proceedings of the TENCON 2018—2018 IEEE Region 10 Conference, Jeju, Republic of Korea, 28–31 October 2018; IEEE: Piscataway, NJ, USA, 2018; p. 0050. [Google Scholar]
Lillicrap, T. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Na, Z.; Wang, J.; Liu, C.; Guan, M.; Gao, Z. Join trajectory optimization and communication design for UAV-enabled OFDM networks. Hoc Netw. 2020, 98, 102031. [Google Scholar] [CrossRef]
Edfors, O.; Sandell, M.; Van De Beek, J.-J.; Landström, D.; Sjöberg, F. An Introduction to Orthogonal Frequency-Division Multiplexing. Ph.D. Thesis, Lund University, Lund, Sweden, 1996. [Google Scholar]
Zeng, Y.; Zhang, R. Energy-efficient UAV communication with trajectory optimization. IEEE Trans. Wirel. Commun. 2017, 16, 3747–3760. [Google Scholar] [CrossRef]
Lee, J.-H.; Park, K.-H.; Ko, Y.-C.; Alouini, M.-S. A UAV-mounted free space optical communication: Trajectory optimization for flight time. IEEE Trans. Wirel. Commun. 2019, 19, 1610–1621. [Google Scholar] [CrossRef]
Liu, C.H.; Chen, Z.; Tang, J.; Xu, J.; Piao, C. Energy-efficient UAV control for effective and fair communication coverage: A deep reinforcement learning approach. IEEE J. Sel. Areas Commun. 2018, 36, 2059–2070. [Google Scholar] [CrossRef]
Yin, S.; Zhao, S.; Zhao, Y.; Yu, F.R. Intelligent trajectory design in UAV-aided communications with reinforcement learning. IEEE Trans. Veh. Technol. 2019, 68, 8227–8231. [Google Scholar] [CrossRef]
Liu, Q.; Shi, L.; Sun, L.; Li, J.; Ding, M.; Shu, F. Path planning for UAV-mounted mobile edge computing with deep reinforcement learning. IEEE Trans. Veh. Technol. 2020, 69, 5723–5728. [Google Scholar] [CrossRef]
Xiong, Z.; Zhang, Y.; Lim, W.Y.B.; Kang, J.; Niyato, D.; Leung, C.; Miao, C. UAV-assisted wireless energy and data transfer with deep reinforcement learning. IEEE Trans. Cogn. Commun. Netw. 2020, 7, 85–99. [Google Scholar] [CrossRef]
Yuan, X.; Hu, S.; Ni, W.; Wang, X.; Jamalipour, A. Deep Reinforcement Learning-Driven Reconfigurable Intelligent Surface-Assisted Radio Surveillance with a Fixed-Wing UAV. IEEE Trans. Inf. Forensics Secur. 2023, 18, 4546–4560. [Google Scholar] [CrossRef]
Zhu, X.; Lin, L.; Huang, Y.; Wang, X.; Que, Y.; Jedari, B.; Piran, M.J. Secure Data Transmission Based on Reinforcement Learning and Position Confusion for Internet of UAVs. IEEE Internet Things J. 2024, 11, 21010–21020. [Google Scholar] [CrossRef]
Peng, Y.; Song, T.; Song, X.; Yang, Y.; Lu, W. Time-Effective UAV-IRS-Collaborative Data Harvesting: A Robust Deep Reinforcement Learning Approach. IEEE Trans. Wirel. Commun. 2024, 23, 18592–18607. [Google Scholar] [CrossRef]
Wang, Y.; Gao, Z.; Zhang, J.; Cao, X.; Zheng, D.; Gao, Y.; Ng, D.W.K.; Di Renzo, M. Trajectory design for UAV-based Internet of Things data collection: A deep reinforcement learning approach. IEEE Internet Things J. 2021, 9, 3899–3912. [Google Scholar] [CrossRef]
Dong, R.; Wang, B.; Cao, K.; Tian, J.; Cheng, T. Secure Transmission Design of RIS Enabled UAV Communication Networks Exploiting Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2024, 73, 8404–8419. [Google Scholar] [CrossRef]
Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P.L.; Sutskever, I.; Abbeel, P. RL²: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv 2016, arXiv:1611.02779. [Google Scholar]
Bharadiya, J.P. Exploring the Use of Recurrent Neural Networks for Time Series Forecasting. Int. J. Innov. Sci. Res. Technol. 2023, 8, 2023–2027. [Google Scholar]
Iparraguirre-Villanueva, O.; Guevara-Ponce, V.; Ruiz-Alvarado, D.; Beltozar-Clemente, S.; Sierra-Liñan, F.; Zapata-Paulini, J.; Cabanillas-Carbonell, M. Text Prediction Recurrent Neural Networks Using Long Short-Term Memory-Dropout. arXiv 2023, arXiv:2302.01459. [Google Scholar] [CrossRef]
Donkol, A.A.E.-B.; Hafez, A.G.; Hussein, A.I.; Mabrook, M.M. Optimization of intrusion detection using likely point PSO and enhanced LSTM-RNN hybrid technique in communication networks. IEEE Access 2023, 11, 9469–9482. [Google Scholar] [CrossRef]
Oyewola, D.O.; Akinwunmi, S.A.; Omotehinwa, T.O. Deep LSTM and LSTM-Attention Q-learning based reinforcement learning in oil and gas sector prediction. Knowl.-Based Syst. 2024, 284, 111290. [Google Scholar] [CrossRef]
Zhou, L.; Chen, X.; Hong, M.; Jin, S.; Shi, Q. Efficient resource allocation for multi-UAV communication against adjacent and co-channel interference. IEEE Trans. Veh. Technol. 2021, 70, 10222–10235. [Google Scholar] [CrossRef]
Rappaport, T.S. Wireless Communications: Principles and Practice; Cambridge University Press: Cambridge, UK, 2024. [Google Scholar]
Al-Hourani, A.; Yanikomeroglu, H. Line-of-Sight Probability and Holding Distance in Non-Terrestrial Networks. IEEE Commun. Lett. 2024, 28, 622–626. [Google Scholar] [CrossRef]
3GPP. User Equipment (UE) Radio Transmission and Reception, 3rd Generation Partnership Project (3GPP), Technical Specification, 38.101, v 16.3.0. 2020. Available online: https://www.3gpp.org/ (accessed on 16 December 2024).
Haider, A.; Hwang, S.-H. Maximum Transmit Power for UE in an LTE Small Cell Uplink. Electronics 2019, 8, 796. [Google Scholar] [CrossRef]
Wang, L.; Wang, K.; Pan, C.; Xu, W.; Aslam, N.; Nallanathan, A. Deep Reinforcement Learning Based Dynamic Trajectory Control for UAV-Assisted Mobile Edge Computing. IEEE Trans. Mobile Comput. 2022, 21, 3536–3550. [Google Scholar] [CrossRef]
Yuan, Y.; Lei, L.; Vu, T.X.; Chatzinotas, S.; Sun, S.; Ottersten, B. Actor-Critic Learning-Based Energy Optimization for UAV Access and Backhaul Networks. J. Wirel. Commun. Netw. 2021, 2021, 78. [Google Scholar] [CrossRef]
Yu, Z.; Chen, Y. Persistent Monitoring UAV Path Planning Based on Entropy Optimization. In Proceedings of the 2023 IEEE 13th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Qinhuangdao, China, 11–14 July 2023; pp. 909–914. [Google Scholar]
Xiao, S.; Tan, X.; Wang, J. A Simulated Annealing Algorithm and Grid Map-Based UAV Coverage Path Planning Method for 3D Reconstruction. Electronics 2021, 10, 853. [Google Scholar] [CrossRef]

Figure 1. UAV communication sketch.

Figure 2. DRL model for UAV messaging.

Figure 3. Illustration of UAV path planning and IoT data collection. (A) Starting position; (B) UAV only connects with IoT device 1; (C) UAV connect with both IoT devices; (D) UAV only connects with IoT device 2.

Figure 4. Value loss functions at different learning rates.

Figure 5. Reward function.

Figure 6. Comparison of data collection under different numbers of channels and IoT devices.

Figure 7. Flight paths of the UAV with different algorithms and IoT device locations, while

β = 10^{- 3}

λ = 10^{- 9}

. (A) IoT devices are distributed along the diagonal of the region

Ω

; (B) IoT devices are distributed in an “S” shape; (C) IoT devices are concentrated on the left side of the diagonal of the region

Ω

; (D) IoT devices are concentrated on the right side of the diagonal of the region

Ω

Figure 7. Flight paths of the UAV with different algorithms and IoT device locations, while

β = 10^{- 3}

λ = 10^{- 9}

. (A) IoT devices are distributed along the diagonal of the region

Ω

; (B) IoT devices are distributed in an “S” shape; (C) IoT devices are concentrated on the left side of the diagonal of the region

Ω

; (D) IoT devices are concentrated on the right side of the diagonal of the region

Ω

Figure 8. Amount of information collected by the UAV with different algorithms and IoT device locations. (A) IoT devices are distributed along the diagonal of the region

Ω

; (B) IoT devices are distributed in an “S” shape; (C) IoT devices are concentrated on the left side of the diagonal of the region

Ω

; (D) IoT devices are concentrated on the right side of the diagonal of the region

Ω

Figure 8. Amount of information collected by the UAV with different algorithms and IoT device locations. (A) IoT devices are distributed along the diagonal of the region

Ω

; (B) IoT devices are distributed in an “S” shape; (C) IoT devices are concentrated on the left side of the diagonal of the region

Ω

; (D) IoT devices are concentrated on the right side of the diagonal of the region

Ω

Table 1. Parameter settings [18].

Parameter	Value
$b_{1}$	10
$b_{2}$	0.6
$α$	2.3
$ρ$	0.2
$σ^{2}$	−170 dBm/Hz
$c_{1}$	$9.26 \times 10^{- 4}$ kg/m
$c_{2}$	$2250 {(kg / m}^{3}) / s^{4}$
$ω$	312.5 kHz
$p^{uplink}$	20 dBm
$m_{uav}$	10 kg

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Y.; Hu, R.; Liang, K.; Liu, Y.; Zhou, Z. Deep Reinforcement Learning Algorithm with Long Short-Term Memory Network for Optimizing Unmanned Aerial Vehicle Information Transmission. Mathematics 2025, 13, 46. https://doi.org/10.3390/math13010046

AMA Style

He Y, Hu R, Liang K, Liu Y, Zhou Z. Deep Reinforcement Learning Algorithm with Long Short-Term Memory Network for Optimizing Unmanned Aerial Vehicle Information Transmission. Mathematics. 2025; 13(1):46. https://doi.org/10.3390/math13010046

Chicago/Turabian Style

He, Yufei, Ruiqi Hu, Kewei Liang, Yonghong Liu, and Zhiyuan Zhou. 2025. "Deep Reinforcement Learning Algorithm with Long Short-Term Memory Network for Optimizing Unmanned Aerial Vehicle Information Transmission" Mathematics 13, no. 1: 46. https://doi.org/10.3390/math13010046

APA Style

He, Y., Hu, R., Liang, K., Liu, Y., & Zhou, Z. (2025). Deep Reinforcement Learning Algorithm with Long Short-Term Memory Network for Optimizing Unmanned Aerial Vehicle Information Transmission. Mathematics, 13(1), 46. https://doi.org/10.3390/math13010046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning Algorithm with Long Short-Term Memory Network for Optimizing Unmanned Aerial Vehicle Information Transmission

Abstract

1. Introduction

2. Related Work

3. Model Description

3.1. Channel Modeling

3.2. Energy Consumption Model

3.3. Problem Formulation

3.4. Deep Reinforcement Learning Model

3.4.1. States

3.4.2. Action

3.4.3. Reward Function

4. Method

4.1. LSTM

4.2. Algorithm Based on Hybrid Action Space

5. Simulation Results and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI