CN113359480B

CN113359480B - Multi-unmanned aerial vehicle and user cooperative communication optimization method based on MAPPO algorithm

Info

Publication number: CN113359480B
Application number: CN202110806485.3A
Authority: CN
Inventors: 赵建伟; 吴官翰; 贾维敏; 张峰干; 姜楠; 王连锋; 谭力宁; 金伟; 金国栋; 沈涛; 张聪; 何芳
Original assignee: Rocket Force University of Engineering of PLA
Current assignee: Rocket Force University of Engineering of PLA
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2022-02-01
Anticipated expiration: 2041-07-16
Also published as: CN113359480A

Abstract

The invention discloses a multi-unmanned aerial vehicle and user cooperative communication optimization method based on a MAPPO algorithm, which comprises the following steps: firstly, establishing an unmanned aerial vehicle network model and a user network model; setting an unmanned aerial vehicle and a user scene; thirdly, acquiring the observation states of the unmanned aerial vehicle and the user; acquiring global states of the unmanned aerial vehicle and the user; fifthly, obtaining rewards of the unmanned aerial vehicle and the user; sixthly, storing experience tuples; seventhly, iteratively optimizing parameters of the network model by using the MAPPO algorithm; and eighthly, optimizing and predicting the communication between multiple unmanned aerial vehicles and multiple users. According to the invention, through the optimization of the parameters of the unmanned aerial vehicle and the user network model, the optimization of the flight azimuth angle, power and bandwidth distribution of the unmanned aerial vehicle is realized, the observation states of a plurality of unmanned aerial vehicles and a plurality of users are effectively adapted to predict and output a reasonable cooperative communication optimization strategy, the throughput of a communication system is maximized under the action of a multidimensional decision, and the fairness of resource distribution is satisfied.

Description

Multi-unmanned aerial vehicle and user cooperative communication optimization method based on MAPPO algorithm

Technical Field

The invention belongs to the technical field of unmanned aerial vehicles and user communication, and particularly relates to a multi-unmanned aerial vehicle and user cooperative communication optimization method based on a MAPPO algorithm.

Background

In the current 5G mobile communication, the ground backbone network bears huge data transmission pressure with the rapid development of various emerging industries. While being limited by geographical conditions, many remote areas are still in a state of insufficient wireless coverage. These unprecedented demands for high-quality wireless communication services present significant challenges to current traditional terrestrial communication networks. For this reason, Unmanned Aerial Vehicles (UAVs) as air access nodes assist ground communication in future 6G and beyond wireless communication become a promising solution.

Unmanned aerial vehicle has stronger flexibility and degree of freedom as flying base station, can stride across multiple topography and provide wireless coverage for the User, can unload the calculation load that part Ground spilled over on the one hand, alleviates Ground base station and calculates transmission pressure, and on the other hand can adjust Ground coverage and region in a flexible way to the Ground User (GU) of corresponding random motion. Meanwhile, due to the good line-of-sight characteristic of the air-ground link of the unmanned aerial vehicle, the probability of non-line-of-sight shielding and shadow effects is greatly reduced, unnecessary path loss is reduced to a certain extent, and the working time of the unmanned aerial vehicle is prolonged under the conditions of limited energy of the unmanned aerial vehicle and equal Quality of Service (QoS) provided by the unmanned aerial vehicle.

Existing unmanned aerial vehicles are mainly subjected to trajectory optimization under fixed communication resource allocation or single communication resource allocation. The optimization goals are limited to drone or ground access control only and are not studied from multiple drones and multiple user planes.

Therefore, a method for optimizing cooperative communication between multiple unmanned aerial vehicles and users based on an MAPPO algorithm is absent at present, optimization of flight azimuth, power and bandwidth distribution of the unmanned aerial vehicles is realized through optimization of parameters of unmanned aerial vehicles and user network models, observation states of the multiple unmanned aerial vehicles and multiple users are effectively adapted to predict and output a reasonable cooperative communication optimization strategy, throughput of a communication system is maximized under multi-dimensional decision-making action, and fairness of resource distribution is met.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a cooperative communication optimization method of multiple unmanned aerial vehicles and users based on MAPPO algorithm, which is simple in steps and reasonable in design, realizes the optimization of flight azimuth angle, power and bandwidth distribution of the unmanned aerial vehicles through the optimization of the unmanned aerial vehicles and user network model parameters, effectively adapts to the observation states of multiple unmanned aerial vehicles and multiple users to predict and output a reasonable cooperative communication optimization strategy, and realizes the maximization of the throughput of a communication system under the action of multidimensional decision and meets the fairness of resource distribution.

In order to solve the technical problems, the invention adopts the technical scheme that: a multi-unmanned aerial vehicle and user cooperative communication optimization method based on a MAPPO algorithm is characterized by comprising the following steps:

step one, establishing an unmanned aerial vehicle network model and a user network model:

step 101, setting parameters of an Actor network of the unmanned aerial vehicle as phi and parameters of a Critic network of the unmanned aerial vehicle as omega₁The parameter of the user Actor network is theta, and the parameter of the user Critic network is omega₂；

Step 102, setting an initial value of a parameter phi of an unmanned aerial vehicle Actor network to be phi (0), and setting a parameter omega of a Critic network of the unmanned aerial vehicle to be omega₁Has an initial value of ω₁(0) The initial value of the parameter theta of the user Actor network is theta (0), and the parameter omega of the user Critic network is omega₂Has an initial value of ω₂(0) (ii) a Wherein phi (0) and omega₁(0) θ (0) and ω₂(0) The orthogonal initialization of the neural network is met;

step two, setting unmanned aerial vehicles and user scenes:

step 201, establishing a two-dimensional rectangular coordinate system OXY; wherein, the two-dimensional rectangular coordinate system is superposed with the ground area D;

step 202, setting N users in the ground area D, wherein the set of users is

Wherein, the position coordinate of the nth user at the tth moment is

N and N are positive integers, N is more than or equal to 1 and less than or equal to N, the ground area D is positioned in the first quadrant of OXY, the origin O is coincident with the lower left corner of the ground area D, and t is a positive integer;

step 203, setting M unmanned aerial vehicles above the ground area D, wherein the unmanned aerial vehicles are integrated into

And is

The deployment heights of the M unmanned aerial vehicles relative to the ground area D are all h;

step three, acquiring the observation states of the unmanned aerial vehicle and the user:

step 301, setting the observation state of the nth user at the tth moment as

And is

Wherein,

indicating the coordinate position of the nth user at the time instant t,

the two-dimensional coordinate position of the mth unmanned aerial vehicle which can be accessed by the nth user at the tth moment under the OXY is represented, M and M are positive integers, and M is more than or equal to 1 and less than or equal to M; s_m(t-j) represents the number of users served by the mth drone at the jth moment before the tth moment, j is a positive integer, and j is 1, …, w; w is a positive integer, and w is less than t;

step 302, the observation state of the nth user at the tth moment

In a user Actor network with an input initial value theta (0), the user Actor network outputs a preactivation component x of the mth unmanned aerial vehicle_m(θ(0))；

Step 303, using a computer to

Obtaining discrete probability distribution of action of the nth user selecting the mth unmanned aerial vehicle at the tth moment

Wherein exp (·) represents an exponential function with a natural constant e as the base,

representing the action of selecting the unmanned aerial vehicle by the nth user at the tth moment;

step 304, the nth user at the tth moment according to the discrete probability distribution

Sampling action

And selecting corresponding unmanned aerial vehicle for access, and acquiring action of selecting unmanned aerial vehicle by nth user at the tth moment

Probability of (2)

305, setting the observation state of the mth unmanned aerial vehicle at the tth moment to be the observation state of the mth unmanned aerial vehicle by adopting a computer according to the user selection and the state of the unmanned aerial vehicle

And is

Wherein,

the two-dimensional coordinate position of the mth unmanned aerial vehicle under the OXY at the tth moment is shown,

indicating the coordinate positions of other unmanned aerial vehicles under OXY after the mth unmanned aerial vehicle is removed at the tth moment, wherein m 'is a positive integer, m' ≠ m, and

σ_m,n(t) represents the status of the nth user accessing the mth drone;

step 306, adopting a computer to observe the observation state of the mth unmanned aerial vehicle at the tth moment

In an unmanned aerial vehicle Actor network with an input initial value of phi (0), the unmanned aerial vehicle Actor network outputs the observation state of the mth unmanned aerial vehicle at the tth moment

Action of mth unmanned aerial vehicle at the next tth moment

Probability distribution of

Wherein,

obeying a beta distribution, i.e.

α_φAnd beta_φAre all shape parameters of beta distribution;

the action of the mth unmanned aerial vehicle at the tth moment is shown;

according to

Sampling action

Obtaining the transmitting power output value of the mth unmanned aerial vehicle to the nth user at the tth moment

Bandwidth output value of mth unmanned aerial vehicle to nth user at tth moment

And the flight azimuth angle of the mth unmanned aerial vehicle at the tth moment

And the motion of the mth unmanned aerial vehicle at the tth moment

Probability of (2)

Step 307, setting by computer

As an action mask of the mth unmanned aerial vehicle at the tth moment, a computer command is adopted

And

wherein,

indicating that the mth unmanned plane masks the nth user with the power value at the tth moment,

indicating that the mth unmanned aerial vehicle masks the nth user with the bandwidth value at the tth moment;

step 308, using a computer to

Obtaining the action component p of the transmitting power distributed to the nth user by the mth unmanned aerial vehicle at the tth moment_m,n(t)；

By computer according to

Obtaining a bandwidth resource action component b distributed to the nth user by the mth unmanned aerial vehicle at the tth moment_m,n(t); wherein, b_m(t) represents the bandwidth resources which can be allocated by the mth unmanned aerial vehicle at the tth moment, and

B_totalrepresenting the total bandwidth resource, s, shared by all UAVs_m(t) represents the total number of users accessing the mth drone, b_minRepresenting a minimum separable bandwidth;

step 309, obtaining the motion of the mth unmanned aerial vehicle at the tth moment by using a computer

And is

Wherein,

representing the flight azimuth angle of the mth unmanned aerial vehicle at the tth moment;

step 30A, the observation state of the nth user at the tth moment is

And the observation state of the mth unmanned aerial vehicle at the tth moment is

Merging the observed states recorded as the ith agent at the tth moment

Wherein, the agent includes M unmanned aerial vehicle and N users, and i is positive integer, and

act of selecting unmanned aerial vehicle by nth user at tth moment

And the action of the mth unmanned aerial vehicle at the tth moment

Merging actions written as ith agent at tth moment

Act of selecting unmanned aerial vehicle by nth user at tth moment

Probability of (2)

And the action of the mth unmanned aerial vehicle at the tth moment

Probability of (2)

Merging action probabilities written as ith agent

Step four, acquiring global states of the unmanned aerial vehicle and the user:

step 401, inputting p in step 309 according to shannon channel capacity by using computer_m,n(t) and b_m,n(t), obtaining the theoretical communication speed c provided by the mth unmanned aerial vehicle for the nth user at the tth moment_m,n(t)；

Step 402, using a computer according to

Obtaining the communication speed of the nth user at the t moment

Step 403, setting the global state of the mth unmanned aerial vehicle at the tth moment to be

And is

Step 404, setting global state of nth user at tth moment as

Wherein,

indicating the coordinate positions of other users under OXY after the nth user is removed at the tth moment, n 'is a positive integer, n' ≠ n, and

step 405, carrying out global state of the mth unmanned aerial vehicle at the tth moment

And global state of nth user at tth moment

Merging global states recorded as ith agent at tth moment

Wherein i is a positive integer, and

step five, obtaining the rewards of the unmanned aerial vehicle and the user:

step 501, adopt the computer to according to

Obtaining the average communication speed c of N users at the t moment_mean(t)；

Step 502, using a computer according to

Obtaining the fairness index f of the mth unmanned aerial vehicle at the tth moment_m(t)；

Step 503, using computer to

Obtain reward of mth unmanned aerial vehicle at tth moment

Wherein r is_dDenotes the reward factor, κ, of the drone_rIs f_m(t) an index parameter of (t),

the boundary penalty item of the mth unmanned aerial vehicle at the tth moment is represented;

step 504, using a computer to

Receive the reward of the nth user at the tth moment

Wherein r is_cA reward factor representing a user;

step 505, reward of nth user at tth moment by computer

And reward of mth unmanned aerial vehicle at tth moment

Incorporating rewards accruing as ith agent at time t

Step six, storing experience tuples:

step 601, adopting a computer to send

The experience tuple is taken as the experience tuple of the ith agent at the tth moment and is stored in a cache region;

step 602, repeating the third step to the step 601, obtaining the experience tuple of the next moment, and storing the experience tuple into the buffer area until T is T ═ T_maxWhen the data is stored, completing data storage of one round; wherein, T_maxRepresenting the total number of moments per round;

603, repeating the step 602, and storing the data of the next round until the number of the test tuples in the buffer area is B to obtain the training data of the first round; wherein B is greater than T_max；

Step seven, parameters of the MAPPO algorithm iterative optimization network model:

step 701, inputting first round training data, and performing gradient rise optimization on a parameter phi of an unmanned aerial vehicle Actor network and a parameter theta of a user Actor network by using a computer through a MAPPO algorithm to obtain a first round optimized value of the parameter phi of the unmanned aerial vehicle Actor network and a first round optimized value of the parameter theta of the user Actor network;

meanwhile, a computer is adopted to center the Critic network omega of the unmanned aerial vehicle by using MAPPO algorithm₁Parameter of and user criticic network omega₂The parameters are optimized by gradient descent to obtain the parameter omega of the Critic network of the unmanned aerial vehicle₁First round of optimization values and parameters omega of the user Critic network₂A first round of optimization values of;

step 702, obtaining next round of training data according to the method from the third step to the step 603;

step 703, inputting next round of training data, and according to the method in step 701, performing next round of optimization updating by using the previous round of optimized values as parameter initial values to obtain next round of optimized values of the parameter phi of the unmanned aerial vehicle Actor network, next round of optimized values of the parameter theta of the user Actor network, and parameter omega of the unmanned aerial vehicle Critic network₁Next round of optimization and parameter omega of user Critic network₂The next round of optimization values;

step 704, according to the method from step three to step 603, completing the set maximum round T_hThe P-th round training data is obtained through data storage; wherein, P is a positive integer;

705, inputting the P-th round of training data, and according to the method in the step 701, obtaining a P-th round of optimized parameter phi of the unmanned aerial vehicle Actor network, a P-th round of optimized parameter theta of the user Actor network, and a parameter omega of the unmanned aerial vehicle criticic network by using the previous round of optimized parameter as a parameter initial value₁The P-th round optimization value and the parameter omega of the user Critic network₂The last round optimized value of the P round;

step eight, optimizing and predicting the cooperative communication of multiple unmanned aerial vehicles and multiple users:

step 801, optimizing the P-th wheel according to the parameter phi of the Actor network of the unmanned aerial vehicleThe value, the P-th round optimization value of the parameter theta of the user Actor network, and the parameter omega of the unmanned aerial vehicle Critic network₁The P-th round optimization value and the parameter omega of the user Critic network₂Obtaining an optimized network model according to the P-th round optimization value;

and 802, acquiring the observation state of the nth user and the observation state of the mth unmanned aerial vehicle at the subsequent moment, and inputting the optimized network model to obtain the cooperative communication optimization action strategy of the mth unmanned aerial vehicle and the nth user at the subsequent moment.

The MAPPO algorithm-based multi-unmanned aerial vehicle and user cooperative communication optimization method is characterized by comprising the following steps: step 401, using computer to input p in step 309 according to shannon channel capacity_m,n(t) and b_m,n(t), obtaining the theoretical communication speed c provided by the mth unmanned aerial vehicle for the nth user at the tth moment_m,n(t), the specific process is as follows:

step 4011, using computer according to formula

Obtaining LoS link probability from the mth unmanned aerial vehicle to the nth user at the tth moment

Wherein a denotes a first constant relating to the environment, b denotes a second constant relating to the environment, d_m,n(t) represents the linear distance from the mth unmanned aerial vehicle to the nth user at the tth moment;

step 4012, using computer according to formula

Obtaining the path loss from the mth unmanned aerial vehicle to the nth user at the tth moment under the LoS link

Wherein ξ_LoSRepresents the added loss under the LoS link, c represents the speed of light, f_cRepresents a signal carrier frequency;

step 4013, adopting computer to calculate according to formula

Obtaining the path loss from the mth unmanned aerial vehicle to the nth user at the tth moment under the NLoS link

Wherein ξ_NLoSRepresenting the additional loss under the NLoS link;

step 4014, using computer according to formula

Obtaining the path loss PL from the mth unmanned aerial vehicle to the nth user signal_m,n(t); wherein,

the probability of NLoS link from the mth unmanned aerial vehicle to the nth user at the tth moment is represented, and

step 4015, using computer according to formula

Obtaining the signal power of the nth user signal at the tth moment for receiving the mth unmanned aerial vehicle

Step 4016, using computer according to formula

Obtaining the theoretical communication speed c provided for the nth user by the mth unmanned aerial vehicle at the tth moment_m,n(t); wherein n is₀Representing the power spectral density of gaussian white noise in the channel.

The MAPPO algorithm-based multi-unmanned aerial vehicle and user cooperative communication optimization method is characterized by comprising the following steps: in the step 4011, a is more than 4.88 and less than 28, and b is more than 0 and less than 1;

additional loss xi under NLoS link in step 4012 and step 4013_NLoSAdditional loss xi greater than in LoS link_LoSAdditional loss xi under LoS link_LoSThe value range of (0dB,50dB), additional loss xi under NLoS link_NLoSThe value range of (10dB,100 dB);

the user reward factor r in step 504_cThe value range of (1) to (3);

reward factor r of the drone in step 503_dHas a value range of 1 to 5, and r_dGreater than r_c(ii) a Index parameter kappa_rThe value range of (1) is a positive integer of 1-5.

The MAPPO algorithm-based multi-unmanned aerial vehicle and user cooperative communication optimization method is characterized by comprising the following steps: boundary penalty item of mth unmanned aerial vehicle at tth moment in step 503

The specific process of obtaining is as follows:

step 5031, setting the upper bound of the ground area D on the X axis as u_max,xThe upper bound of the ground area D on the Y axis is u_max,yThe lower bound of the ground area D on the X-axis is u_min,xThe lower bound of the ground area D on the Y axis is u_min,y(ii) a And u is_min,x＝u_min,y＝0；

Step 5032, adopting a computer to determine the position of the mth unmanned aerial vehicle at the tth moment

Obtaining the X coordinate of the mth unmanned aerial vehicle at the tth moment

And the Y coordinate of the mth unmanned aerial vehicle at the tth moment

Step 5033, when

Greater than u_max,xOr

Less than u_min,xAccording to the computer

Obtaining the boundary punishment item of the mth unmanned aerial vehicle at the tth moment

Wherein r is_bDenotes a penalty factor, κ_bRepresenting gradient factors for determining the smoothness of the boundary function, and a penalty factor r_bHas a value range of 10 to 50 and a gradient factor kappa_b0.07 to 0.1;

when in use

Greater than u_max,yOr

Less than u_min,yAccording to the computer

When in use

Greater than u_max,xAnd is

Greater than u_max,yOr

Less than u_min,xAnd is

Less than u_min,yAccording to the computer

When in use

And

are all located in the ground area D,

the MAPPO algorithm-based multi-unmanned aerial vehicle and user cooperative communication optimization method is characterized by comprising the following steps: in the step 301, the value range of w is 3-20;

alpha in step 306_φAnd beta_φThe following are satisfied: alpha is alpha_φ≥1，β_φ≥1。

The MAPPO algorithm-based multi-unmanned aerial vehicle and user cooperative communication optimization method is characterized by comprising the following steps: maximum round T set in step 704_hThe value range of (1) is 5000-6000;

total number of wheels

Compared with the prior art, the invention has the following advantages:

1. the method has simple steps and reasonable design, is suitable for the games of a plurality of unmanned aerial vehicles and a plurality of users, realizes the prediction of the cooperative communication optimization strategy, maximizes the throughput of a communication system under the action of multidimensional decision and meets the fairness of resource allocation.

2. The method comprises the steps of firstly establishing an unmanned aerial vehicle network model and a user network model, then obtaining training data through unmanned aerial vehicle and user scene setting, unmanned aerial vehicle and user observation state obtaining, unmanned aerial vehicle and user global state obtaining, unmanned aerial vehicle and user reward obtaining and experience tuple storing, and training the training data through MAPPO algorithm to realize updating and optimization of parameters of the network model to obtain an optimized network model; and finally, inputting the observation state of the user and the observation state of the unmanned aerial vehicle at the subsequent moment into the optimized network model so as to obtain the cooperative communication optimization strategy of the unmanned aerial vehicle and the user.

3. According to the invention, parameters of an unmanned aerial vehicle Actor network, parameters of a user Actor network, parameters of an unmanned aerial vehicle criticic network and parameters of the user criticic network are trained and iterated by using a MAPP algorithm, so that all users can acquire communication rate by themselves through greedy maximization of a competition strategy, each unmanned aerial vehicle intelligently allocates power and bandwidth resources for users who select to access the unmanned aerial vehicle, dynamically decides flight azimuth angles of the unmanned aerial vehicle, and forms a most appropriate space topological structure under the current environment through cooperation with other unmanned aerial vehicles.

4. The invention performs joint optimization on the access strategy of users, the power distributed by the unmanned aerial vehicles, the bandwidth resource scheduling distributed by the unmanned aerial vehicles and the flight azimuth angle of the unmanned aerial vehicles, and all the unmanned aerial vehicles share the total bandwidth resource, thereby maximizing the system throughput through dynamic resource scheduling and simultaneously ensuring the fairness of the communication rate among the users under the condition of meeting the constraint condition of the minimum communication rate of each user.

5. The invention adopts MAPP (Multi-Agent public Policy Optimization) algorithm to solve the problem of coexistence of discrete and continuous actions of various types of agents. Different from the previous method for centrally deciding the multidimensional action of the unmanned aerial vehicle cluster, the MAPPO algorithm considers partial observability under the real condition, so that each agent only depends on self-observation distributed decision. The defects that the dimensionality is too high and cannot be expanded and the like caused by a centralized decision-making mode when a single-agent reinforcement learning algorithm is used for processing the problem of multiple agents are overcome.

6. Aiming at the practical problem that different unmanned aerial vehicles can be selectively accessed by different numbers of users, the unmanned aerial vehicle resource allocation strategy dimensionality is dynamically adjusted by setting the action mask, and the user information which is not selectively accessed is shielded by the action mask, namely, the unmanned aerial vehicle only needs to allocate resources for the user which is selectively accessed.

7. Aiming at the fact that the flight azimuth angle of the unmanned aerial vehicle is bounded when the flight azimuth angle of the unmanned aerial vehicle is optimized, the parameterized beta strategy is adopted to replace the traditional Gaussian strategy, the problem of biased estimation of the Gaussian strategy under the condition that the action of the unmanned aerial vehicle is bounded can be solved, and the phenomenon that the unmanned aerial vehicle converges to local optimum under the multi-peak reward environment is improved.

8. The invention not only carries out strategy allocation on power, but also carries out strategy allocation on bandwidth, thereby improving the flexibility and latitude of allocation.

In conclusion, the method provided by the invention has the advantages of simple steps and reasonable design, realizes the optimization of the flight azimuth angle, the power and the bandwidth distribution of the unmanned aerial vehicle through the optimization of the unmanned aerial vehicle and the user network model parameters, effectively adapts to the observation states of a plurality of unmanned aerial vehicles and a plurality of users to predict and output a reasonable cooperative communication optimization strategy, and realizes the maximization of the throughput of a communication system under the action of a multidimensional decision and meets the fairness of resource distribution.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a block diagram of the process flow of the present invention.

Detailed Description

As shown in fig. 1, a method for optimizing cooperative communication between multiple drones and users based on MAPPO algorithm includes the following steps:

Step 102, setting an initial value of a parameter phi of an unmanned aerial vehicle Actor network to be phi (0), and setting a parameter omega of a Critic network of the unmanned aerial vehicle to be omega₁Of (2) is initiatedValue of omega₁(0) The initial value of the parameter theta of the user Actor network is theta (0), and the parameter omega of the user Critic network is omega₂Has an initial value of ω₂(0) (ii) a Wherein phi (0) and omega₁(0) θ (0) and ω₂(0) The orthogonal initialization of the neural network is met;

step two, setting unmanned aerial vehicles and user scenes:

step 202, setting N users in the ground area D, wherein the set of users is

Wherein, the position coordinate of the nth user at the tth moment is

And is

step 301, setting the observation state of the nth user at the tth moment as

And is

Wherein,

indicating the coordinate position of the nth user at the time instant t,

step 302, the observation state of the nth user at the tth moment

Step 303, using a computer to

Sampling action

Probability of (2)

And is

Wherein,

σ_m,n(t) represents the status of the nth user accessing the mth drone;

Action of mth unmanned aerial vehicle at the next tth moment

Probability distribution of

Wherein,

obeying a beta distribution, i.e.

α_φAnd beta_φAre all shape parameters of beta distribution;

the action of the mth unmanned aerial vehicle at the tth moment is shown;

according to

Sampling action

Bandwidth output value of mth unmanned aerial vehicle to nth user at tth moment

And the motion of the mth unmanned aerial vehicle at the tth moment

Probability of (2)

Step 307, setting by computer

And

wherein,

step 308, using a computer to

By computer according to

And is

Wherein,

step 30A, the observation state of the nth user at the tth moment is

Merging the observed states recorded as the ith agent at the tth moment

act of selecting unmanned aerial vehicle by nth user at tth moment

And the action of the mth unmanned aerial vehicle at the tth moment

Merging actions written as ith agent at tth moment

Act of selecting unmanned aerial vehicle by nth user at tth moment

Probability of (2)

And the action of the mth unmanned aerial vehicle at the tth moment

Probability of (2)

Merging action probabilities written as ith agent

Step four, acquiring global states of the unmanned aerial vehicle and the user:

Step 402, using a computer according to

Obtaining the communication speed of the nth user at the t moment

And is

Step 404, setting global state of nth user at tth moment as

Wherein,

And global state of nth user at tth moment

Merging global states recorded as ith agent at tth moment

Wherein i is a positive integer, and

step five, obtaining the rewards of the unmanned aerial vehicle and the user:

step 501, adopt the computer to according to

Step 502, using a computer according to

Step 503, using computer to

To obtain the firstReward of mth unmanned aerial vehicle at t moments

step 504, using a computer to

Receive the reward of the nth user at the tth moment

Wherein r is_cA reward factor representing a user;

step 505, reward of nth user at tth moment by computer

And reward of mth unmanned aerial vehicle at tth moment

Incorporating rewards accruing as ith agent at time t

Step six, storing experience tuples:

step 601, adopting a computer to send

step 602, repeating the third step to the step 601, obtaining the experience tuple of the next moment, and storing the experience tuple into the buffer area until T is T ═ T_maxWhen the data is stored, completing data storage of one round; wherein,T_maxrepresenting the total number of moments per round;

705, inputting the P-th round training data, and according to the method in the step 701, obtaining a P-th round optimized value of a parameter phi of an Actor network of the unmanned aerial vehicle, a P-th round optimized value of a parameter theta of the Actor network of the user, and a Crit of the unmanned aerial vehicle by using the previous round optimized value as a parameter initial valueParameter omega of ic network₁The P-th round optimization value and the parameter omega of the user Critic network₂The last round optimized value of the P round;

step 801, according to the P-th round optimization value of the parameter phi of the unmanned plane Actor network, the P-th round optimization value of the parameter theta of the user Actor network, and the parameter omega of the unmanned plane Critic network₁The P-th round optimization value and the parameter omega of the user Critic network₂Obtaining an optimized network model according to the P-th round optimization value;

In this embodiment, step 401 is performed by inputting p in step 309 into a computer according to the shannon channel capacity_m,n(t) and b_m,n(t), obtaining the theoretical communication speed c provided by the mth unmanned aerial vehicle for the nth user at the tth moment_m,n(t), the specific process is as follows:

step 4011, using computer according to formula

step 4012, using computer according to formula

step 4013, adopting computer to calculate according to formula

Wherein ξ_NLoSRepresenting the additional loss under the NLoS link;

step 4014, using computer according to formula

step 4015, using computer according to formula

Step 4016, using computer according to formula

Obtaining the theoretical communication speed c provided for the nth user by the mth unmanned aerial vehicle at the tth moment_m,n(t); wherein n is₀Presentation letterPower spectral density of gaussian white noise in the tract.

In this embodiment, in step 4011, a is greater than 4.88 and less than 28, and b is greater than 0 and less than 1;

the user reward factor r in step 504_cThe value range of (1) to (3);

In this embodiment, in step 503, the boundary penalty term of the mth unmanned aerial vehicle at the tth moment

The specific process of obtaining is as follows:

Obtaining the X coordinate of the mth unmanned aerial vehicle at the tth moment

And the Y coordinate of the mth unmanned aerial vehicle at the tth moment

Step 5033, when

Greater than u_max,xOr

Less than u_min,xAccording to the computer

when in use

Greater than u_max,yOr

Less than u_min,yAccording to the computer

When in use

Greater than u_max,xAnd is

Greater than u_max,yOr

Less than u_min,xAnd is

Less than u_min,yAccording to the computer

When in use

And

are all located in the ground area D,

in this embodiment, the value range of w in step 301 is 3-20;

In this embodiment, the maximum round T set in step 704_hThe value range of (1) is 5000-6000;

total number of wheels

In this embodiment, the area D is a 2km × 2km square area, the deployment height h of M unmanned aerial vehicles with respect to the ground area D is 500M, when each round starts, all unmanned aerial vehicles take off from the origin, and the users are randomly distributed in the area D and move in random directions and at random speeds, T_max＝1000。

In this embodiment, the maximum round T is set_hThe value of (A) is 5000, and the value of B is 2000-4000.

In this embodiment, the value range of w is 3.

In this embodiment, total transmission power P of transmission of each unmanned aerial vehicle_total10mw, total bandwidth resource B shared by all UAVs_total30MHz, signal carrier frequency f_cPower spectral density n of white gaussian noise in a channel at 2GHz₀＝1×10^-17mw/Hz, minimum separable bandwidth b_min＝0.1MHz。

In this example, T_max1000, each decision time interval is 1s, i.e. the interval between the first time t and the first time t +1 is 1 s.

In this embodiment, when σ is actually used_m,nWhen (t) is 1, it means that the nth user selects the mth drone as the access base station, and otherwise, it is 0.

In this embodiment, the user reward coefficient r_cIs 1, the reward coefficient r of the unmanned aerial vehicle_dIs taken as 2, the exponential parameter k_rThe value of (a) is 5,

in this embodiment, the penalty term coefficient r_bHas a value range of 20 and a gradient factor kappa_bIs 8 x 10^-2。

In this example, α_φ＝β_φ＝1。

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all simple modifications, changes and equivalent structural changes made to the above embodiment according to the technical spirit of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. A multi-unmanned aerial vehicle and user cooperative communication optimization method based on a MAPPO algorithm is characterized by comprising the following steps:

step two, setting unmanned aerial vehicles and user scenes:

step 202, setting N users in the ground area D, wherein the set of users is

Wherein, the position coordinate of the nth user at the tth moment is

And is

step 301, setting the observation state of the nth user at the tth moment as

And is

Wherein,

indicating the coordinate position of the nth user at the time instant t,

step 302, the observation state of the nth user at the tth moment

Step 303, using a computer to

Sampling action

Probability of (2)

And is

Wherein,

σ_m,n(t) represents the status of the nth user accessing the mth drone;

Action of mth unmanned aerial vehicle at the next tth moment

Probability distribution of

Wherein,

obeying a beta distribution, i.e.

α_φAnd beta_φAre all shape parameters of beta distribution;

the action of the mth unmanned aerial vehicle at the tth moment is shown;

according to

Sampling action

The launch power of the mth unmanned aerial vehicle to the nth user at the tth moment is obtainedRate output value

Bandwidth output value of mth unmanned aerial vehicle to nth user at tth moment

And the motion of the mth unmanned aerial vehicle at the tth moment

Probability of (2)

Step 307, setting by computer

And

wherein,

step 308, using a computer to

By computer according to

B_totalrepresenting the total bandwidth resource, s, shared by all UAVs_m(t) represents the total number of users accessing the mth drone, b_minRepresenting a minimum separable bandwidth; p_totalRepresenting the total transmit power of the transmissions of each drone;

And is

Wherein,

step 30A, the observation state of the nth user at the tth moment is

Are combined and recorded as the ith time of the tObserved state of agent

act of selecting unmanned aerial vehicle by nth user at tth moment

And the action of the mth unmanned aerial vehicle at the tth moment

Merging actions written as ith agent at tth moment

Act of selecting unmanned aerial vehicle by nth user at tth moment

Probability of (2)

And the action of the mth unmanned aerial vehicle at the tth moment

Probability of (2)

Merging action probabilities written as ith agent

Step four, acquiring global states of the unmanned aerial vehicle and the user:

step 401,Using a computer to input p in step 309 according to the shannon channel capacity_m,n(t) and b_m,n(t), obtaining the theoretical communication speed c provided by the mth unmanned aerial vehicle for the nth user at the tth moment_m,n(t)；

Step 402, using a computer according to

Obtaining the communication speed of the nth user at the t moment

And is

Step 404, setting global state of nth user at tth moment as

Wherein,

And global state of nth user at tth moment

Merging global states recorded as ith agent at tth moment

Wherein i is a positive integer, and

step five, obtaining the rewards of the unmanned aerial vehicle and the user:

step 501, adopt the computer to according to

Step 502, using a computer according to

Step 503, using computer to

Obtain reward of mth unmanned aerial vehicle at tth moment

step 504, using a computer to

Receive the reward of the nth user at the tth moment

Wherein r is_cA reward factor representing a user;

step 505, reward of nth user at tth moment by computer

And reward of mth unmanned aerial vehicle at tth moment

Incorporating rewards accruing as ith agent at time t

Step six, storing experience tuples:

step 601, adopting a computer to send

step 801, according to the P-th round optimization value of the parameter phi of the unmanned plane Actor network, the P-th round optimization value of the parameter theta of the user Actor network, and the parameter omega of the unmanned plane Critic network₁The P-th round optimization value and the parameter omega of the user Critic network₂The P-th best ofChanging the value to obtain an optimized network model;

2. The method for optimizing cooperative communication between multiple unmanned aerial vehicles and users based on MAPPO algorithm according to claim 1, wherein: step 401, using computer to input p in step 309 according to shannon channel capacity_m,n(t) and b_m,n(t), obtaining the theoretical communication speed c provided by the mth unmanned aerial vehicle for the nth user at the tth moment_m,n(t), the specific process is as follows:

step 4011, using computer according to formula

step 4012, using computer according to formula

step 4013, adopting computer to calculate according to formula

Wherein ξ_NLoSRepresenting the additional loss under the NLoS link;

step 4014, using computer according to formula

step 4015, using computer according to formula

Step 4016, using computer according to formula

3. The method for optimizing cooperative communication between multiple unmanned aerial vehicles and users based on MAPPO algorithm according to claim 2, wherein: in the step 4011, a is more than 4.88 and less than 28, and b is more than 0 and less than 1;

the user reward factor r in step 504_cThe value range of (1) to (3);

4. The method for optimizing cooperative communication between multiple unmanned aerial vehicles and users based on MAPPO algorithm according to claim 1, wherein: boundary penalty item of mth unmanned aerial vehicle at tth moment in step 503

The specific process of obtaining is as follows:

Obtaining the X coordinate of the mth unmanned aerial vehicle at the tth moment

And the Y coordinate of the mth unmanned aerial vehicle at the tth moment

Step 5033, when

Greater than u_max,xOr

Less than u_min,xAccording to the computer

when in use

Greater than u_max,yOr

Less than u_min,yAccording to the computer

When in use

Greater than u_max,xAnd is

Greater than u_max,yOr

Less than u_min,xAnd is

Less than u_min,yAccording to the computer

When in use

And

are all located in the ground area D,

5. the method for optimizing cooperative communication between multiple unmanned aerial vehicles and users based on MAPPO algorithm according to claim 1, wherein: in the step 301, the value range of w is 3-20;

6. The method for optimizing cooperative communication between multiple unmanned aerial vehicles and users based on MAPPO algorithm according to claim 1, wherein: maximum round T set in step 704_hThe value range of (1) is 5000-6000;

total number of wheels