CN109729528B - D2D resource allocation method based on multi-agent deep reinforcement learning - Google Patents
D2D resource allocation method based on multi-agent deep reinforcement learning Download PDFInfo
- Publication number
- CN109729528B CN109729528B CN201910161391.8A CN201910161391A CN109729528B CN 109729528 B CN109729528 B CN 109729528B CN 201910161391 A CN201910161391 A CN 201910161391A CN 109729528 B CN109729528 B CN 109729528B
- Authority
- CN
- China
- Prior art keywords
- communication
- cellular
- user
- link
- pair
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013468 resource allocation Methods 0.000 title claims abstract description 62
- 230000002787 reinforcement Effects 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000004891 communication Methods 0.000 claims abstract description 212
- 230000001413 cellular effect Effects 0.000 claims abstract description 99
- 239000003795 chemical substances by application Substances 0.000 claims abstract description 69
- 238000005457 optimization Methods 0.000 claims abstract description 22
- 230000005540 biological transmission Effects 0.000 claims abstract description 20
- 238000001228 spectrum Methods 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 19
- 230000006870 function Effects 0.000 claims description 27
- 238000012549 training Methods 0.000 claims description 22
- 230000009471 action Effects 0.000 claims description 17
- 238000004088 simulation Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000002452 interceptive effect Effects 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000003595 spectral effect Effects 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 4
- 239000000654 additive Substances 0.000 claims description 3
- 230000000996 additive effect Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 230000005251 gamma ray Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 12
- 238000010586 diagram Methods 0.000 description 4
- 238000007726 management method Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000011664 signaling Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a D2D resource allocation method based on multi-agent deep reinforcement learning, and belongs to the field of wireless communication. Firstly, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum, establishing a signal-to-interference-and-noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user based on the existing interference, then respectively calculating unit bandwidth communication rates of a cellular link and a D2D link, and then constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target; aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair; and respectively extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain a resource allocation scheme of each D2D communication pair. The invention optimizes the frequency spectrum allocation and the transmission power, maximizes the system capacity and provides a low-complexity resource allocation algorithm.
Description
Technical Field
The invention belongs to the field of wireless communication, relates to a heterogeneous cellular network system, and particularly relates to a D2D resource allocation method based on multi-agent deep reinforcement learning.
Background
The popularization of intelligent terminals and the blowout type development of mobile internet services put higher requirements on the data transmission capability of wireless communication networks. Under the current trend, the existing cellular network has the problems of spectrum resource shortage, heavy base station load and the like, and cannot meet the transmission requirement of the future wireless network.
Device-to-Device (D2D) communication allows neighboring users to establish a direct link for communication, which is a promising technology in future wireless communication networks because it has the advantages of improving spectral efficiency, saving power consumption and offloading base station load. The D2D communication is introduced into the cellular network, so that on one hand, energy consumption can be saved, and the performance of edge users can be improved, and on the other hand, the spectrum utilization rate can be greatly improved by sharing the spectrum of the cellular users through the D2D communication.
However, the spectrum of the D2D communication multiplexing cellular network causes cross-layer interference to the cellular communication link, the communication quality of the cellular user as the primary user of the cellular frequency band should be ensured, and meanwhile, in the case of dense D2D communication deployment, the same spectrum multiplexed by a plurality of D2D communication links causes peer-to-peer interference between each other, so the interference management problem when the cellular network and the D2D communication coexist is an urgent problem to be solved. The wireless network resource allocation aims at relieving interference through reasonable resource allocation, improves the utilization efficiency of frequency spectrum resources, and is an effective way for solving the interference management problem.
Existing research on D2D communication resource allocation in cellular networks can be divided into centralized and distributed categories. The centralized method assumes that the base station has instant global Channel State Information (CSI), and the base station controls resource allocation of the D2D user, but huge signaling overhead is required for the base station to acquire the global Channel State Information, and in a future massive wireless device scenario, the base station is difficult to have the instant global Information, so that the centralized algorithm is no longer applicable in a future communication device intensive scenario.
The distributed method enables a D2D user to autonomously select wireless network resources, and the existing research is mainly based on game theory and reinforcement learning. The game theory method models D2D users as game players to compete for the game until Nash equilibrium state, but the solution of Nash equilibrium state requires a great deal of information exchange among users and a great deal of iteration to converge. The resource allocation research based on reinforcement learning is mainly based on Q learning, such as a Deep Q Network (DQN), and a D2D user is regarded as an agent, and a wireless Network resource is selected by an autonomous learning strategy. However, when a plurality of agents learn and train, the strategy of each agent changes, which causes unstable training environment and makes the training difficult to converge. Therefore, a distributed resource allocation algorithm with good convergence and low complexity needs to be researched to solve the problem of interference management of D2D communication in a cellular network.
Disclosure of Invention
In order to solve the problems, the invention provides a D2D resource allocation method based on multi-agent deep reinforcement learning based on a deep reinforcement learning theory, optimizes spectrum allocation and transmission power of a D2D user, realizes system capacity maximization of a cellular network and D2D communication, and ensures communication quality of the cellular user.
The method comprises the following specific steps:
step one, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum;
the heterogeneous network model comprises cellular base stations BS, M cellular downlink users and N D2D communication pairs.
Setting the mth cellular user as CmWherein M is more than or equal to 1 and less than or equal to M; the nth D2D communication pair is DnWherein N is more than or equal to 1 and less than or equal to N. D2D communication pair DnFor transmitting and receiving users inAndand (4) showing.
The cellular downlink communication link and the D2D link adopt the orthogonal frequency division multiplexing technology, each cellular user occupies one communication resource block RB, and no interference exists between any two cellular links; while allowing one cellular user to share the same RB with multiple D2D users, the communication resource blocks RB and transmission power are selected autonomously by the D2D user.
Step two, establishing a signal-to-interference-and-noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user based on interference existing in a heterogeneous network model;
interference includes three types: 1) cellular users experience interference from transmitting users in each D2D communication pair sharing the same RB; 2) interference experienced by the receiving users in each D2D communication pair from the base station; 3) the receiving user in each D2D communication pair is subject to interference from the transmitting user in all other D2D communication pairs that share the same RBs.
Cellular user CmThe received signal SINR on the kth communication resource block RB from the base station is:
PBrepresents the fixed transmit power of the base station;for base station to cellular user CmThe channel gain of the downlink target link; dkA set of all D2D communication pairs representing a shared kth RB;representing D2D communication pair DnThe transmitting power of the transmitting user;for D2D communication pair D when multiple links share RBnMiddle transmitting userTo cellular subscriber CmThe channel gain of the interfering link of (a); n is a radical of0Representing the power spectral density of additive white gaussian noise.
D2D communication pair DnThe SINR of the received signal of the receiving user on the kth RB is:
for D2D communication pair DnTo a transmitting userTo the receiving userD2D channel gain of the target link;for base station to D2D communication pair D when multiple links share RBnTo a receiving userThe channel gain of the interfering link of (a);representing D2D communication pair DiThe transmitting power of the transmitting user;for D2D communication pair D when multiple links share RBiMiddle transmitting userTo the receiving userThe channel gain of the interfering link of (a);
thirdly, calculating the unit bandwidth communication rates of the cellular link and the D2D link respectively by using the SINR of the cellular user and the SINR of the D2D receiving user;
step four, calculating system capacity by using the communication rate of the cellular link and the D2D link in unit bandwidth, and constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target;
the optimization model is as follows:
BN×K=[bn,k]an allocation matrix of communication resource blocks RB for D2D communication pairs, bn,kFor D2D communication pair DnThe RB selection parameter of (a) is,a power control vector that is composed jointly for the transmit powers of all D2D communication pairs.
Constraint C1 indicates that the SINR of each cellular user is greater than the minimum threshold for the cellular user's received SINREnsuring the communication quality of cellular users; the constraint condition C2 represents a D2D link spectrum allocation constraint condition, and each D2D user pair can be allocated with only one communication resource block RB at most; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold Pmax。
Step five, aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair;
the specific construction steps are as follows:
step 501, for a certain D2D communication pair DpConstructing a state feature vector s at time slot tt;
Instantaneous channel state information for the D2D communication link;for base station to the D2D communication pair DpReceiving instantaneous channel state information of an interference link of a user; i ist-1The D2D communication pair D for the last time slot t-1pReceiving an interference power value received by a user;the D2D communication pair D for the last time slot t-1pThe neighboring D2D communication pair occupied RB;the D2D communication pair D for the last time slot t-1pThe RBs occupied by the neighboring cellular users.
Step 502, simultaneously constructing the D2D communication pair DpA return function r at time slot tt;
rnIn negative return, rn<0;
Step 503, constructing the state characteristics of the multi-agent Markov game model by using the state characteristic vectors of the D2D communication pair; in order to optimize the Markov game model, a return function in the deep reinforcement learning model of the multi-agent actor critic is established by utilizing the return function of the D2D communication pair;
each agent markov game model is:
wherein,is a space of states that is,is an action space, rjThe method is characterized in that the method is a return value of a return corresponding to a return function of the jth D2D communication pair, j ∈ { 1.., N }, p is a state transition probability of the whole environment, and gamma is a discount coefficient.
The goal of each D2D communication pair learning is to maximize the total discount return for that D2D communication pair;
t is the time range; gamma raytIs the discount coefficient to the power of t;is the return value of the return function of the jth D2D communication pair at time slot t.
The Actor Critic reinforcement learning model consists of an Actor (Actor) and a Critic (Critic);
in the training process, the strategy of the actor is fitted by using a deep neural network, and is updated by using the following deterministic strategy gradient formula so as to obtain the maximum expected return.
Let mu be { mu ═ mu1,...,μNDenotes the deterministic policy for all agents, θ ═ θ1,...,θNThe parameters contained in the strategy are expressed, and the gradient formula of the expected return of the jth agent is as follows:
s contains state information for all agents, s ═ s1,...,sN}; a contains the action information of all agents, a ═ a1,...,aN};Is an experience replay buffer;
the critics also use deep neural networks to fit by minimizing a centralized action-cost functionTo update the loss function of:
wherein,each sample is represented by a tuple(s)t,at,rt,st+1) Records the historical data of all agents,including the reward of all agents at time slot t.
Step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the DpProblem of resource allocationThe model of (1).
And step six, extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain the resource allocation scheme of each D2D communication pair.
The resource allocation scheme includes selecting appropriate communication resource blocks RB and transmission power.
The invention has the advantages that:
(1) a D2D resource allocation method based on multi-agent deep reinforcement learning optimizes the spectrum allocation and transmission power of D2D users, and maximizes the system capacity while ensuring the communication quality of cellular users;
(2) a D2D resource allocation method based on multi-agent deep reinforcement learning designs a D2D distributed resource allocation algorithm in a heterogeneous cellular network, thereby greatly reducing the signaling overhead generated for obtaining global instant channel state information;
(3) a D2D resource allocation method based on multi-agent deep reinforcement learning innovatively introduces a multi-agent reinforcement learning model with centralized training and distributed execution, solves the problem of resource allocation by multi-D2D communication, obtains good training convergence performance, and provides a low-complexity resource allocation algorithm.
Drawings
Fig. 1 is a schematic diagram of a heterogeneous network model of a cellular network and D2D communication sharing spectrum, which is constructed by the present invention;
FIG. 2 is a flow chart of a D2D resource allocation method based on multi-agent deep reinforcement learning according to the present invention;
FIG. 3 is a diagram illustrating a deep reinforcement learning model for D2D communication resource allocation according to the present invention;
FIG. 4 is a diagram of a model for reinforcement learning of a critic of a single agent actor in accordance with the present invention;
FIG. 5 is a diagram of a multi-agent actor critic reinforcement learning model of the present invention;
fig. 6 is a graph comparing the outage rates of cellular users according to the present invention with the DQN-based D2D resource allocation method and the D2D random resource allocation method.
Fig. 7 is a graph comparing the total system capacity performance of the present invention with the DQN-based D2D resource allocation method and the D2D random resource allocation method.
FIG. 8 is a graphical illustration of the total reward function and system capacity convergence performance of the present invention;
fig. 9 is a graph of the total return function and the system capacity convergence performance of the DQN-based D2D resource allocation method of the present invention.
Detailed Description
In order that the technical principles of the present invention may be more clearly understood, embodiments of the present invention are described in detail below with reference to the accompanying drawings.
A D2D Resource Allocation Method (MADRL, Multi-Agent deep Learning based Device-to-Device Resource Allocation Method) based on Multi-Agent deep reinforcement Learning is applied to a heterogeneous network with a cellular network and D2D communication coexisting; firstly, respectively establishing a signal-to-interference-and-noise ratio (SINR) and a unit bandwidth communication rate expression of a D2D receiving user and a cellular user, taking the maximized system capacity as an optimization target, and taking the SINR of the cellular user larger than a minimum SINR threshold, a D2D link spectrum allocation constraint condition and the transmitting power of a D2D transmitting user smaller than a maximum transmitting power threshold as optimization conditions, and constructing a D2D resource allocation optimization model in a heterogeneous network;
constructing a state feature vector and a return function of a multi-agent deep reinforcement learning model for D2D resource allocation according to an optimization model; establishing a multi-agent actor critic deep reinforcement learning model for D2D resource allocation based on a partially observable Markov game model and an actor critic reinforcement learning theory;
performing offline training by using historical communication data obtained by the simulation platform;
according to the instantaneous channel state information of the D2D link, the instantaneous channel state information of the interference link of the user received by the base station to the D2D, the interference power value received by the user received by the D2D in the previous time slot, the communication Resource Block (RB) occupied by the D2D link adjacent to the D2D link in the previous time slot and the RB occupied by the cellular user communication adjacent to the D2D link in the previous time slot, the Resource allocation strategy obtained by training is used for selecting the proper RB and the transmission power.
As shown in fig. 2, the whole system comprises five steps of establishing a system model, proposing an optimization problem, establishing an optimization model, establishing a multi-agent reinforcement learning model, training the model and executing an algorithm; the method comprises the steps of establishing a multi-agent reinforcement learning model, wherein the step of establishing the multi-agent reinforcement learning model comprises the steps of establishing state characteristics, designing a return function and establishing a multi-agent actor critic reinforcement learning model;
the method comprises the following specific steps:
step one, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum;
as shown in fig. 1, the heterogeneous network model includes a cellular Base Station (BS), M cellular downlink users, and N D2D communication pairs.
Setting the mth cellular user as CmWherein M is more than or equal to 1 and less than or equal to M; the nth D2D communication pair is DnWherein N is more than or equal to 1 and less than or equal to N. D2D communication pair DnFor transmitting and receiving users inAndand (4) showing.
The cellular downlink communication link and the D2D link both adopt an Orthogonal Frequency Division Multiplexing (OFDM) technology, each cellular user occupies one communication resource block RB, and there is no interference between any two cellular links; in the system model, one cellular user is allowed to share the same RB simultaneously with multiple D2D users, with communication resource blocks RB and transmission power being autonomously selected by D2D users.
Step two, based on the Interference existing in the heterogeneous network model, establishing a signal to Interference plus Noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user;
interference includes three types: 1) cellular users experience interference from transmitting users in each D2D communication pair sharing the same RB; 2) interference experienced by the receiving users in each D2D communication pair from the base station; 3) the receiving user in each D2D communication pair is subject to interference from the transmitting user in all other D2D communication pairs that share the same RBs.
Cellular user CmThe received signal SINR on the kth communication resource block RB from the base station is:
PBrepresents the fixed transmit power of the base station;for base station to cellular user CmThe channel gain of the downlink target link; dkA set of all D2D communication pairs representing a shared kth RB;representing D2D communication pair DnThe transmitting power of the transmitting user;for D2D communication pair D when multiple links share RBnMiddle transmitting userTo cellular subscriber CmThe channel gain of the interfering link of (a); n is a radical of0Represents the power spectral density of Additive White Gaussian Noise (AWGN).
D2D communication pair DnThe SINR of the received signal of the receiving user on the kth RB is:
for D2D communication pair DnTo a transmitting userTo the receiving userD2D channel gain of the target link;for base station to D2D communication pair D when multiple links share RBnTo a receiving userThe channel gain of the interfering link of (a);representing D2D communication pair DiThe transmitting power of the transmitting user;for D2D communication pair D when multiple links share RBiMiddle transmitting userTo the receiving userThe channel gain of the interfering link of (a);
thirdly, calculating the unit bandwidth communication rates of the cellular link and the D2D link respectively by using the SINR of the cellular user and the SINR of the D2D receiving user;
cellular link communication rate per bandwidth based on shannon's formulaThe calculation formula is as follows:
step four, calculating system capacity by using the communication rate of the cellular link and the D2D link in unit bandwidth, and constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target;
due to the requirement of optimizing the distribution matrix B of the communication resource blocks RB of the D2D communication pair on the premise of ensuring the communication quality of the cellular userN×K=[bn,k]Power control vector jointly formed with transmit powers of all D2D communication pairsMaximizing system capacity, building an optimization model as follows:
bn,kfor D2D communication pair DnThe RB selection parameter of (1).
Constraint C1 characterizes the cellular user SINR constraint, meaning that the SINR of each cellular user is greater than the minimum threshold for the cellular user's received SINREnsuring the communication quality of cellular users; the constraint C2 characterizes the D2D link spectrum allocation constraint, and each D2D user pair can be allocated with only one communication resource block R at mostB; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold Pmax。
Step five, aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair;
establishing a reinforcement learning model for D2D resource allocation, as shown in FIG. 3, the principle is: in a time slot t, each D2D communication pair acts as an agent, slave to the state spaceIn which a state s is observedtThen from the action space according to the strategy pi and the current stateIn which an action a is selectedtD2D communication pair selects the RB to use and the transmission power; performing action atThereafter, the D2D communication pair observes a context transition to a new state st+1And obtaining a report rtD2D communication pair based on the reward r obtainedtThe strategy is adjusted to achieve higher returns. The specific construction steps are as follows:
step 501, for a certain D2D communication pair DpConstructing a state feature vector s at time slot tt;
instantaneous channel state information for the D2D communication link;for base station to the D2D communication pair DpReceiving instantaneous channel state information of an interference link of a user; i ist-1For the last time slot t-1D2D communication pair DpReceiving an interference power value received by a user;the D2D communication pair D for the last time slot t-1pThe neighboring D2D communication pair occupied RB;the D2D communication pair D for the last time slot t-1pThe RBs occupied by the neighboring cellular users.
Step 502, simultaneously, according to the optimization objective, the D2D communication pair D is constructedpA return function r at time slot tt;
The reward function is designed to take into account both the lowest received SINR threshold for the cellular user and the unit bandwidth rate of the D2D communication pair. If the cellular user receiving SINR for the shared spectrum in communication with D2D can satisfy the cellular user signal-to-noise ratio constraint, a positive reward is obtained; otherwise, a negative reward r will be obtainedn,rnIs less than 0. To boost the capacity of the D2D communication link, the positive reward is set to the unit bandwidth communication rate of the D2D link:
thus, the reward function is as follows:
step 503, constructing the state characteristics of the multi-agent Markov game model by using the state characteristic vectors of the D2D communication pair; in order to optimize the Markov game model, a return function in the deep reinforcement learning model of the multi-agent actor critic is established by utilizing the return function of the D2D communication pair;
each agent uses an Actor Critic reinforcement learning model, which is composed of an Actor (Actor) and a Critic (Critic), and the strategies of the Actor and Critic are obtained by using deep neural network fitting as shown in fig. 4. D2D actor netNetwork input environment state stOutput action atNamely, selecting an RB and transmission power; critic network input environment state vector stAnd selected action atAnd outputting a time Difference error (TD error) calculated based on the Q value, wherein the time Difference error drives the learning of the two networks.
In the heterogeneous cellular network, the resource allocation of a plurality of D2D communication pairs is a multi-agent reinforced learning problem and can be modeled as a partially observable Markov game model, and the Markov game models of N agents are as follows:
wherein,is a space of states that is,is an action space, rjThe value of the return value of the jth intelligent agent is the return value corresponding to the return function of the jth D2D communication pair, j ∈ { 1.·, N }, p is the state transition probability of the whole environment, and gamma is a discount coefficient.
The goal of each agent's learning is to maximize its total discount return;
t is the time range; gamma raytIs the discount coefficient to the power of t;is the return value of the return function of the jth D2D communication pair at time slot t.
Aiming at the Markov game model, the reinforcement learning model of the actor critics is expanded to a multi-agent scene, and a deep reinforcement learning model of the multi-agent is constructed, as shown in FIG. 5. During training, the critic part uses historical global information to guide the actor part to update the strategy; when the system is executed, the single agent only uses part of the environmental information obtained by observation and uses the actor strategy obtained by training to make action selection, thereby realizing centralized training and distributed execution.
In the centralized training process, the strategy of N agents uses pi ═ pi1,...,πNDenotes, θ ═ θ1,...,θNDenotes the parameters contained in the policy, where the jth agent expects a rewardThe gradient of (d) is:
here, s includes status information of all agents, and s ═ s1,...,sN}; a contains the action information of all agents, a ═ a1,...,aN};The method is a centralized action-value function, takes the state information and actions of all agents as input, and outputs the Q value of the jth agent.
Extending the above description to deterministic policies, deterministic policies are considered(abbreviated as mu)j) Let μ ═ μ1,...,μNThe deterministic policies of all agents are represented, the gradient of the j-th agent's expected reward is:
here, theIs an empirical playback buffer whichIn tuples(s) of each samplet,at,rt,st+1) Record the historical data of all agents, hereIncluding the reward of all agents at time slot t. The strategy of the actor part is fitted by using a deep neural network, the gradient formula is an updating method of the actor network, and a gradient ascending method is used for updating so as to obtain the maximum expected return.
The critic network also uses a deep neural network for fitting by minimizing a centralized action-cost functionTo update the loss function of:
step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the DpA model of a resource allocation problem.
The training steps are as follows:
(1) initializing a cell, a base station, a cellular link, and a D2D link using a communication simulation platform;
(2) initializing strategy models pi and parameters theta of all agents, and initializing communication simulation time slot number T;
(3) initializing a communication simulation time slot t ← 0;
(4) all D2D communications obtain status information s for viewing the environmenttBased on stAnd pi select action atObtaining a report rt,t←t+1;
(7) training by using small batch processing data, and updating a parameter theta of a strategy pi;
(8) returning to the step (4), and ending the training until T is equal to T;
(9) returning a parameter theta;
and step six, extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain the resource allocation scheme of each D2D communication pair.
The resource allocation scheme includes selecting appropriate communication resource blocks RB and transmission power.
The execution steps are as follows:
(1) initializing a cell, a base station, a cellular link, a D2D link using a communication simulation platform;
(2) initializing strategy models pi of all agents, importing the trained parameters theta into the models pi, and initializing communication simulation time slot number T;
(3) initializing a communication simulation time slot t ← 0;
(4) all D2D communications obtain status information s for viewing the environmenttBased on stAnd pi select action atI.e., RB and transmit power, statistics D2D receive SINR and system capacity of the user;
(5) t ← t +1, the simulation platform updates the environment, all D2D communications obtain s for the observation environmentt+1;
(6) And returning to the step 4 until T is T.
Respectively comparing the D2D resource allocation method based on the multi-agent with the D2D resource allocation method and the D2D random resource allocation method based on DQN;
as shown in fig. 6, MADRL represents the method of the present invention, DQN represents the D2D resource allocation method based on the deep Q network, Random represents the D2D resource allocation method based on Random allocation, and the three methods respectively affect the communication quality of cellular users, and it can be known from the figure that the algorithm MADRL of the present invention can achieve the lowest cellular user outage probability when the number of users is different, D2D;
as shown in fig. 7, for the influence of the three methods on the total capacity of the system, the algorithm MADRL of the present invention achieves the maximum system capacity as the number of D2D communication pairs increases.
FIG. 8 illustrates the overall reward function and system capacity convergence performance of the present invention; as shown in fig. 9, the total reward function and the system capacity convergence of the DQN-based D2D resource allocation method are shown, and compared with the two methods, the method introduces global information into the training process for centralized training, so that the training environment is more stable and the convergence performance is better. From this it can be concluded that: the MADRL can achieve higher system throughput than Random and DQN while preserving the quality of cellular user communications, while having better convergence performance than DQN.
In conclusion, by implementing the multi-agent reinforcement learning-based D2D resource allocation method, the communication quality of cellular users can be protected, and the system throughput can be maximized; compared with a centralized algorithm, the distributed resource allocation algorithm designed by the invention reduces signaling overhead; compared with other resource allocation algorithms based on Q learning, the algorithm designed by the invention has better convergence performance.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.
Claims (3)
1. A D2D resource allocation method based on multi-agent deep reinforcement learning is characterized by comprising the following specific steps:
step one, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum;
the heterogeneous network model comprises a cellular Base Station (BS), M cellular downlink users and N D2D communication pairs;
setting the mth cellular user as CmWherein M is more than or equal to 1 and less than or equal to M; the nth D2D communication pair is DnWherein N is more than or equal to 1 and less than or equal to N; D2D communication pair DnFor transmitting and receiving users inAndrepresents;
the cellular downlink communication link and the D2D link adopt the orthogonal frequency division multiplexing technology, each cellular user occupies one communication resource block RB, and no interference exists between any two cellular links; simultaneously, one cellular user is allowed to share the same RB with a plurality of D2D users, and communication Resource Blocks (RB) and transmission power are autonomously selected by the D2D users;
step two, establishing a signal-to-interference-and-noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user based on interference existing in a heterogeneous network model;
cellular user CmThe received signal SINR on the kth communication resource block RB from the base station is:
PBrepresents the fixed transmit power of the base station;for base station to cellular user CmThe channel gain of the downlink target link; dkA set of all D2D communication pairs representing a shared kth RB;representing D2D communication pair DnThe transmitting power of the transmitting user;for D2D communication pair D when multiple links share RBnMiddle transmitting userTo cellular subscriber CmThe channel gain of the interfering link of (a); n is a radical of0A power spectral density representative of additive white gaussian noise;
D2D communication pair DnThe SINR of the received signal of the receiving user on the kth RB is:
for D2D communication pair DnTo a transmitting userTo the receiving userD2D channel gain of the target link;for base station to D2D communication pair D when multiple links share RBnTo a receiving userThe channel gain of the interfering link of (a);representing D2D communication pair DiThe transmitting power of the transmitting user;when there are multiple linksWhen RB is shared, D2D communication is paired with DiMiddle transmitting userTo the receiving userThe channel gain of the interfering link of (a);
thirdly, calculating the unit bandwidth communication rates of the cellular link and the D2D link respectively by using the SINR of the cellular user and the SINR of the D2D receiving user;
step four, calculating system capacity by using the communication rate of the cellular link and the D2D link in unit bandwidth, and constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target;
the optimization model is as follows:
BN×K=[bn,k]an allocation matrix of communication resource blocks RB for D2D communication pairs, bn,kFor D2D communication pair DnThe RB selection parameter of (a) is,a power control vector composed jointly for the transmit powers of all D2D communication pairs;
constraint C1 indicates that the SINR of each cellular user is greater than the minimum threshold for the cellular user's received SINREnsuring the communication quality of cellular users; the constraint condition C2 represents a D2D link spectrum allocation constraint condition, and each D2D user pair can be allocated with only one communication resource block RB at most; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold Pmax;
Step five, aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair;
the specific construction steps are as follows:
step 501, for a certain D2D communication pair DpConstructing a state feature vector s at time slot tt;
Instantaneous channel state information for the D2D communication link;for base station to the D2D communication pair DpReceiving instantaneous channel state information of an interference link of a user; i ist-1The D2D communication pair D for the last time slot t-1pReceiving an interference power value received by a user;the D2D communication pair D for the last time slot t-1pThe neighboring D2D communication pair occupied RB;the D2D communication pair D for the last time slot t-1pRBs occupied by neighboring cellular users;
step 502, simultaneously constructing the D2D communication pair DpA return function r at time slot tt;
rnIn negative return, rn<0;
Step 503, constructing the state characteristics of the multi-agent Markov game model by using the state characteristic vectors of the D2D communication pair; in order to optimize the Markov game model, a return function in the deep reinforcement learning model of the multi-agent actor critic is established by utilizing the return function of the D2D communication pair;
each agent markov game model is:
wherein,is a space of states that is,is an action space, rjThe method comprises the steps that a return value corresponding to a return function of a jth D2D communication pair is j ∈ { 1.., N }, p is the state transition probability of the whole environment, and gamma is a discount coefficient;
the goal of each D2D communication pair learning is to maximize the total discount return for that D2D communication pair;
t is the time range; gamma raytIs the discount coefficient to the power of t; r ist jIs the return value of the return function of the jth D2D communication pair at the time slot t;
the deep reinforcement learning model of the actor critics consists of actors and critics;
in the training process, the strategy of the actor is fitted by using a deep neural network, and is updated by using the following deterministic strategy gradient formula so as to obtain the maximum expected return;
let mu be { mu ═ mu1,...,μNDenotes the deterministic policy for all agents, θ ═ θ1,...,θNThe parameters contained in the strategy are expressed, and the gradient formula of the expected return of the jth agent is as follows:
s contains state information for all agents, s ═ s1,...,sN}; a contains the action information of all agents, a ═ a1,...,aN};Is an experience replay buffer;
the critics also use deep neural networks to fit by minimizing a centralized action-cost functionTo update the loss function of:
wherein,each sample is represented by a tuple(s)t,at,rt,st+1) In the form of a record of the historical data, r, of all agentst={rt 1,...,rt NThe rewards of all agents in the time slot t are included;
step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the DpA model of a resource allocation problem;
and step six, extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain the resource allocation scheme of each D2D communication pair.
2. The multi-agent deep reinforcement learning-based D2D resource allocation method as claimed in claim 1, wherein the interference in step two includes three types: 1) cellular users experience interference from transmitting users in each D2D communication pair sharing the same RB; 2) interference experienced by the receiving users in each D2D communication pair from the base station; 3) the receiving user in each D2D communication pair is subject to interference from the transmitting user in all other D2D communication pairs that share the same RBs.
3. The multi-agent deep reinforcement learning-based D2D resource allocation method as claimed in claim 1, wherein the resource allocation scheme in step six includes selecting appropriate communication resource blocks RB and transmission power;
the execution steps are as follows:
(1) initializing a cell, a base station, a cellular link, a D2D link using a communication simulation platform;
(2) initializing strategy models pi of all agents, importing the trained parameters theta into the models pi, and initializing communication simulation time slot number T;
(3) initializing a communication simulation time slot t ← 0;
(4) all D2D communications obtain status information s for viewing the environmenttBased on stAnd pi select action atI.e., RB and transmit power, statistics D2D receive SINR and system capacity of the user;
(5) t ← t +1, the simulation platform updates the environment, all D2D communications obtain s for the observation environmentt+1;
And returning to the step (4) until T is T.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2018115721684 | 2018-12-21 | ||
CN201811572168 | 2018-12-21 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109729528A CN109729528A (en) | 2019-05-07 |
CN109729528B true CN109729528B (en) | 2020-08-18 |
Family
ID=66300856
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910161391.8A Active CN109729528B (en) | 2018-12-21 | 2019-03-04 | D2D resource allocation method based on multi-agent deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109729528B (en) |
Families Citing this family (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110267274B (en) * | 2019-05-09 | 2022-12-16 | 广东工业大学 | Spectrum sharing method for selecting sensing users according to social credibility among users |
CN110049474B (en) * | 2019-05-17 | 2020-07-17 | 北京邮电大学 | Wireless resource allocation method, device and base station |
CN112188505B (en) * | 2019-07-02 | 2024-05-10 | 中兴通讯股份有限公司 | Network optimization method and device |
CN112383922B (en) * | 2019-07-07 | 2022-09-30 | 东北大学秦皇岛分校 | Deep reinforcement learning frequency spectrum sharing method based on prior experience replay |
CN110267338B (en) * | 2019-07-08 | 2020-05-22 | 西安电子科技大学 | Joint resource allocation and power control method in D2D communication |
CN110582072B (en) * | 2019-08-16 | 2020-07-10 | 北京邮电大学 | Fuzzy matching-based resource allocation method and device in cellular internet of vehicles |
CN112633491A (en) * | 2019-10-08 | 2021-04-09 | 华为技术有限公司 | Method and device for training neural network |
CN110784882B (en) * | 2019-10-28 | 2022-06-28 | 南京邮电大学 | Energy acquisition D2D communication resource allocation method based on reinforcement learning |
CN110856268B (en) * | 2019-10-30 | 2021-09-07 | 西安交通大学 | Dynamic multichannel access method for wireless network |
CN110769514B (en) * | 2019-11-08 | 2023-05-12 | 山东师范大学 | Heterogeneous cellular network D2D communication resource allocation method and system |
CN111026549B (en) * | 2019-11-28 | 2022-06-10 | 国网甘肃省电力公司电力科学研究院 | Automatic test resource scheduling method for power information communication equipment |
CN111065102B (en) * | 2019-12-16 | 2022-04-19 | 北京理工大学 | Q learning-based 5G multi-system coexistence resource allocation method under unlicensed spectrum |
CN111526592B (en) * | 2020-04-14 | 2022-04-08 | 电子科技大学 | Non-cooperative multi-agent power control method used in wireless interference channel |
CN111556572B (en) * | 2020-04-21 | 2022-06-07 | 北京邮电大学 | Spectrum resource and computing resource joint allocation method based on reinforcement learning |
CN111787624B (en) * | 2020-06-28 | 2022-04-26 | 重庆邮电大学 | Variable dimension resource allocation method based on deep learning |
CN112118632B (en) * | 2020-09-22 | 2022-07-29 | 电子科技大学 | Adaptive power distribution system, method and medium for micro-cell base station |
CN112584347B (en) * | 2020-09-28 | 2022-07-08 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | UAV heterogeneous network multi-dimensional resource dynamic management method |
CN112272353B (en) * | 2020-10-09 | 2021-09-28 | 山西大学 | Device-to-device proximity service method based on reinforcement learning |
CN112533237B (en) * | 2020-11-16 | 2022-03-04 | 北京科技大学 | Network capacity optimization method for supporting large-scale equipment communication in industrial internet |
CN112752266B (en) * | 2020-12-28 | 2022-05-24 | 中国人民解放军陆军工程大学 | Joint spectrum access and power control method in D2D haptic communication |
CN112822781B (en) * | 2021-01-20 | 2022-04-12 | 重庆邮电大学 | Resource allocation method based on Q learning |
CN113115451A (en) * | 2021-02-23 | 2021-07-13 | 北京邮电大学 | Interference management and resource allocation scheme based on multi-agent deep reinforcement learning |
CN113115355B (en) * | 2021-04-29 | 2022-04-22 | 电子科技大学 | Power distribution method based on deep reinforcement learning in D2D system |
CN113473419B (en) * | 2021-05-20 | 2023-07-07 | 南京邮电大学 | Method for accessing machine type communication device into cellular data network based on reinforcement learning |
CN113543271B (en) * | 2021-06-08 | 2022-06-07 | 西安交通大学 | Effective capacity-oriented resource allocation method and system |
CN113596786B (en) * | 2021-07-26 | 2023-11-14 | 广东电网有限责任公司广州供电局 | Resource allocation grouping optimization method for end-to-end communication |
CN113766661B (en) * | 2021-08-30 | 2023-12-26 | 北京邮电大学 | Interference control method and system for wireless network environment |
CN113810910B (en) * | 2021-09-18 | 2022-05-20 | 大连理工大学 | Deep reinforcement learning-based dynamic spectrum sharing method between 4G and 5G networks |
KR20240049289A (en) * | 2021-10-01 | 2024-04-16 | 엘지전자 주식회사 | Progressive feature transmission method and device for edge inference |
CN113867178B (en) * | 2021-10-26 | 2022-05-31 | 哈尔滨工业大学 | Virtual and real migration training system for multi-robot confrontation |
CN114245401B (en) * | 2021-11-17 | 2023-12-05 | 航天科工微电子系统研究院有限公司 | Multi-channel communication decision method and system |
CN114363938B (en) * | 2021-12-21 | 2024-01-26 | 深圳千通科技有限公司 | Cellular network flow unloading method |
CN114423070B (en) * | 2022-02-10 | 2024-03-19 | 吉林大学 | Heterogeneous wireless network power distribution method and system based on D2D |
CN114531736A (en) * | 2022-02-22 | 2022-05-24 | 河南大学 | Heterogeneous wireless D2D network link scheduling method based on graph neural network |
CN114928549A (en) * | 2022-04-20 | 2022-08-19 | 清华大学 | Communication resource allocation method and device of unauthorized frequency band based on reinforcement learning |
CN114900827B (en) * | 2022-05-10 | 2024-05-31 | 福州大学 | Concealed communication system in D2D heterogeneous cellular network based on deep reinforcement learning |
CN115173922B (en) * | 2022-06-30 | 2024-03-15 | 深圳泓越信息科技有限公司 | Multi-beam satellite communication system resource allocation method based on CMADDQN network |
CN115243295A (en) * | 2022-07-25 | 2022-10-25 | 郑州大学 | IRS-assisted SWIPT-D2D system resource allocation method based on deep reinforcement learning |
CN115442812B (en) * | 2022-11-08 | 2023-04-07 | 湖北工业大学 | Deep reinforcement learning-based Internet of things spectrum allocation optimization method and system |
CN115811788B (en) * | 2022-11-23 | 2023-07-18 | 齐齐哈尔大学 | D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning |
CN115544899B (en) * | 2022-11-23 | 2023-04-07 | 南京邮电大学 | Water plant water intake pump station energy-saving scheduling method based on multi-agent deep reinforcement learning |
CN116155991B (en) * | 2023-01-30 | 2023-10-10 | 杭州滨电信息技术有限公司 | Edge content caching and recommending method and system based on deep reinforcement learning |
CN116193608B (en) * | 2023-02-22 | 2024-08-06 | 重庆理工大学 | 6G dense networking non-overlapping interference resource allocation method based on deep reinforcement learning |
CN116193405B (en) * | 2023-03-03 | 2023-10-27 | 中南大学 | Heterogeneous V2X network data transmission method based on DONA framework |
CN116489683B (en) * | 2023-06-21 | 2023-08-18 | 北京邮电大学 | Method and device for unloading computing tasks in space-sky network and electronic equipment |
CN117835441B (en) * | 2024-01-11 | 2024-10-18 | 阳光凯讯(北京)科技股份有限公司 | 5G/6G wireless resource intelligent allocation method based on deep reinforcement learning |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104995851A (en) * | 2013-03-08 | 2015-10-21 | 英特尔公司 | Distributed power control for d2d communications |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108834109B (en) * | 2018-05-03 | 2021-03-19 | 中国人民解放军陆军工程大学 | D2D cooperative relay power control method based on Q learning under full-duplex active eavesdropping |
-
2019
- 2019-03-04 CN CN201910161391.8A patent/CN109729528B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104995851A (en) * | 2013-03-08 | 2015-10-21 | 英特尔公司 | Distributed power control for d2d communications |
Non-Patent Citations (3)
Title |
---|
D2D通信中基于Q学习的联合资源分配与功率控制算法;王倩;《南京大学学报》;20181130;第1183-1192页 * |
Location-Aware Hypergraph Coloring Based Spectrum Allocation for D2D Communication;Zheng Li等;《IEEE》;20181015;第1-6页 * |
SECURE SOCIAL NETVUORKS IN 5G SYSTEMS WITH MOBILE EDGE COMPUTING,CACHING, AND DEVICE-TO-DEVICE CONINIUNICATIONS;Ying He等,;《IEEE》;20180704;第103-109页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109729528A (en) | 2019-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109729528B (en) | D2D resource allocation method based on multi-agent deep reinforcement learning | |
CN109862610B (en) | D2D user resource allocation method based on deep reinforcement learning DDPG algorithm | |
CN109474980B (en) | Wireless network resource allocation method based on deep reinforcement learning | |
CN111800828B (en) | Mobile edge computing resource allocation method for ultra-dense network | |
Wang et al. | Joint interference alignment and power control for dense networks via deep reinforcement learning | |
Zhang et al. | Deep reinforcement learning for multi-agent power control in heterogeneous networks | |
CN111405569A (en) | Calculation unloading and resource allocation method and device based on deep reinforcement learning | |
CN107426773B (en) | Energy efficiency-oriented distributed resource allocation method and device in wireless heterogeneous network | |
CN114867030B (en) | Dual-time scale intelligent wireless access network slicing method | |
Elsayed et al. | Deep reinforcement learning for reducing latency in mission critical services | |
CN107172576B (en) | D2D communication downlink resource sharing method for enhancing cellular network security | |
CN113596785A (en) | D2D-NOMA communication system resource allocation method based on deep Q network | |
CN116390125A (en) | Industrial Internet of things cloud edge cooperative unloading and resource allocation method based on DDPG-D3QN | |
CN108965009A (en) | A kind of load known users correlating method based on gesture game | |
CN116347635A (en) | NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning | |
Yin et al. | Decentralized federated reinforcement learning for user-centric dynamic TFDD control | |
Giri et al. | Deep Q-learning based optimal resource allocation method for energy harvested cognitive radio networks | |
CN114423070B (en) | Heterogeneous wireless network power distribution method and system based on D2D | |
Zhao et al. | Power control for D2D communication using multi-agent reinforcement learning | |
Wang et al. | Resource allocation in multi-cell NOMA systems with multi-agent deep reinforcement learning | |
Chen et al. | iPAS: A deep Monte Carlo Tree Search-based intelligent pilot-power allocation scheme for massive MIMO system | |
Jiang et al. | Dueling double deep q-network based computation offloading and resource allocation scheme for internet of vehicles | |
Liu et al. | Power allocation in ultra-dense networks through deep deterministic policy gradient | |
Sun et al. | Channel selection and power control for D2D communication via online reinforcement learning | |
Saied et al. | Resource management based on reinforcement learning for D2D communication in cellular networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |