CN115052355B

CN115052355B - Network-assisted full duplex mode optimization method under mass terminals URLLC

Info

Publication number: CN115052355B
Application number: CN202210649515.9A
Authority: CN
Inventors: 李佳珉; 朱悦; 朱鹏程; 王东明; 尤肖虎
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2024-07-05
Anticipated expiration: 2042-06-09
Also published as: CN115052355A

Abstract

The invention relates to a network auxiliary full duplex mode optimization method under a massive terminal URLLC, which aims at the problem of maximizing the resource utilization efficiency of an uplink and a downlink of network auxiliary full duplex under a non-cellular large-scale MIMO scene and provides a Wolf-PHC intelligent algorithm with expandability for distributed operation. Wherein logical scalability is achieved in that each Remote Antenna Unit (RAU) is considered as an agent, and comprises a local processor that can perform its own associated data processing tasks and optimize local performance metrics based on decisions of other RAUs. When a new RAU is added into the system, the computing power of a Central Processing Unit (CPU) is not required to be upgraded, the data information transmitted by all RAUs is not required to be retrained at the CPU, and the distributed execution operation enables the system algorithm to have expandability. The intelligent distributed operation algorithm provided by the invention is more suitable for dynamic scenes of mass terminals URLLC, has lower complexity and has lower storage space requirement than the traditional centralized Q-learning algorithm.

Description

Network-assisted full duplex mode optimization method under mass terminals URLLC

Technical Field

The invention relates to a network-assisted full-duplex honeycomb-free large-scale MIMO scene duplex mode optimization method with expandability, which is applicable to a scene of high-reliability low-delay communication (URLLC) of a mass terminal, and belongs to the technical field of mobile communication.

Background

Full Duplex (FD) technology helps to improve system throughput and reduce system latency, which is important in a high reliability and low latency communication (URLLC) scenario. The Network Assisted Full Duplex (NAFD) technology under the architecture of the non-cellular distributed massive MIMO system is a flexible novel duplex technology, and the system comprises a central processing unit, a plurality of remote antenna units and a plurality of users. Each RAU (remote antenna unit ) can perform uplink reception or downlink transmission, and the specific transmission mode to be selected is determined by the CPU. NAFD can provide low latency services compared to conventional time division duplexing; NAFD can support asymmetric traffic without degrading spectrum utilization compared to conventional frequency division duplexing. The NAFD flexible duplex technology can support URLLC communication of mass terminals, and can reduce collision delay caused by unlicensed access in URLLC and ensure reliability of user access by scheduling uplink or downlink modes of Remote Antenna Units (RAUs) through a CPU, NAFD has no self-interference of RAUs, and can reduce self-interference elimination delay in traditional duplex.

With the explosive growth of the number of mobile terminal users, the problem of system resource utilization rate and reliable and rapid access mechanism brought by mass terminals are yet to be researched. In URLLC scenes of mass terminals, the node mode selection mechanism in NAFD flexible duplex is associated with the action mechanisms of factors such as special short packet transmission, bit error rate constraint and the like in URLLC to be researched. The requirements of massive terminals and actual systems on the algorithm theoretical expandability are to be researched.

Disclosure of Invention

Technical problems: aiming at the problem that the network-assisted full duplex mode optimization technology based on load perception is suitable for massive terminals URLLC to maximize the resource utilization rate of the system, the invention provides a network-assisted full duplex mode optimization method under massive terminals URLLC.

The technical scheme is as follows: the network-assisted full duplex mode optimization method under the mass terminal URLLC provided by the invention comprises the following steps:

the method adopts an intelligent algorithm based on WoLF-PHC to optimize, and comprises the following steps:

step 1: defining a load-aware utility function for each user i:

Wherein U _i is a utility function of load perception, which is used for representing the resource utilization rate of the system and used for representing the resource utilization rate of the system; k is the number of allocable total resource blocks of each remote antenna unit RAU, K is the total number of users, n _m,i is the number of resource blocks allocated to user i by the mth remote antenna unit RAU, n _m,a is the number of resource blocks allocated to user a by the mth RAU, and the sum term is then calculated The sum of the number of resource blocks allocated to all users by the mth remote antenna unit RAU is called RAU m for short, and can be calculated by the following formula:

Wherein, Is the bandwidth required by user i according to its own quality of service QoS, b is the bandwidth occupied by each resource block, gamma _i is the signal-to-interference-and-noise ratio SINR of user i, R _i is the short packet achievable rate in the URLLC scenario, V (gamma _i) is the dispersion of user channels, m is the length of the short packet block, Q ^-1 (·) is the Q function inverse, epsilon ₀ is the error decoding probability DEP, e is the natural logarithm,Meaning round up;

step 2: the optimization objective is to maximize the user's resource utility function based on load awareness:

Wherein, U _U,i is the load sensing utility function value of the uplink, U _D,j is the load sensing utility function value of the downlink, K _u is the number of users in the uplink, and K _d is the number of users in the downlink; u and d are the uplink and downlink identifiers, i and j are the ith uplink user and the jth downlink user, respectively; to determine in which mode each remote antenna unit RAU should operate, two binary allocation vectors x _u,x_d∈{0,1}^M×1 are defined, M being the total number of RAUs, the binary allocation vector of the ith uplink RAU or the binary allocation vector of the jth downlink RAU if the RAU is used for uplink or downlink The value is 1, otherwise, the value is 0; the effective load-sensing utility function values for the uplink and downlink can be represented by equations (5) and (6), respectively:

wherein X _u＝diag(x_u),X_d＝diag(x_d is defined), diag (a) represents the formation of a diagonal matrix with element a; m _u is the number of uplink RAUs, M _d is the number of downlink RAUs, k _U,m is the number of resource blocks available for allocation by uplink RAU M, k _D,m is the number of resource blocks available for allocation by downlink RAUm, n _m,i is the number of resource blocks allocated to uplink user i when the requirement of QoS of uplink user i is satisfied, n _m,j is the number of resource blocks allocated to downlink user j when the requirement of QoS of downlink user j is satisfied by RAUm;

Step 3: and optimizing the resource utility function by using an intelligent algorithm, and storing a final state set and rewards of the algorithm as an optimal RAU duplex mode and maximized resource utilization efficiency.

Wherein:

The intelligent algorithm based on WoLF-PHC is as follows:

WoLF means a carefully slow adjustment parameter when the agent is doing better than the expected value, and a fast pace adjustment parameter when the agent is doing worse than the expected value;

PHC is a learning algorithm of a single agent under a stable environment, the core of the algorithm is the idea of normal reinforcement learning, the selection probability of the action capable of obtaining the maximum accumulation expected is increased, and the algorithm can be converged to an optimal strategy;

The WoLF-PHC intelligent algorithm is an expandability algorithm suitable for distributed execution of multiple intelligent agents, and combines the WoLF and PHC algorithms, so that rewards obtained by the intelligent agents can be quickly adjusted to adapt to strategy changes of other intelligent agents when the rewards are worse than expected, and the rewards are carefully learned when the rewards are better than expected, so that the time for adapting to the strategy changes of the other intelligent agents can be shortened; the WoLF-PHC algorithm can converge to a Nash equilibrium strategy, and when other intelligent agents adopt a certain fixed strategy, the WoLF-PHC algorithm can also converge to an optimal strategy under the current condition instead of converging to a Nash equilibrium strategy with poor possible effect; the WoLF-PHC algorithm does not need to observe strategies, actions and rewarding values of other agents, less space is needed to record the Q value, and the WoLF-PHC algorithm carries out learning improvement strategy through the PHC algorithm, so that no linear programming or quadratic programming is needed to solve Nash equilibrium, and the algorithm speed is improved; in the context of mass terminals URLLC, distributed operations may make the algorithm logically scalable.

The Wolf-PHC algorithm comprises an average estimation strategyThe update of (c) follows the following equation:

Where pi _i(s,a_i) is the policy to be taken under a particular state-action pair, C(s) is the number of times that state s occurs, pi _i(s,a_i) is updated as follows:

wherein Q (s, a) represents a cost function obtained by taking action a in state s, and Q value is updated to update formula reference formula (8); For increment or decrement of update policy, when the currently selected action a _i is not the action to maximize Q value, decrement is used Updating, if the currently selected action a _i is the action to maximize Q, by deltaUpdating; the value of δ _sa in turn depends on the estimation strategy pi _i(s,a_i) and the equationDelta is an updated auxiliary parameter, and A _i is the size of the action space, and the specific value of delta is shown as (5); delta _w is the positive updating auxiliary parameter adopted when the rewards obtained by the intelligent agent are better than expected, delta _l is the negative updating auxiliary parameter adopted when the rewards obtained by the intelligent agent are worse than expected;

In the intelligent algorithm based on Wolf-PHC, each remote antenna unit RAU in the system is independently regarded as an intelligent body, data detection and node mode selection are carried out locally, and the data detection and node mode selection do not need to be uploaded to a central processing center CPU for centralized calculation; for each agent, the state space has only two states s _t＝{s₁,s₂},s₁ to indicate that the RAU working mode is uplink receiving, s ₂ indicates that the RAU working mode is downlink transmitting, the action space is set to have only two actions a _t＝{a₁,a₂},a₁ to indicate that the RAU changes the original working mode, a ₂ indicates that the RAU keeps the original working mode unchanged, so that the size of the Q table is 2×2, if the total number of RAUs is M, the total space for storing Q values only needs m×2×2, which is far smaller than the storage space of the Q table of 2 ^M ×m required for centralized processing uploaded to the CPU, and the complexity is lower.

The RAU is received in the uplink, and the following formula is awarded:

when the RAU is downlink, the following formula is awarded:

the Q value is updated according to the following formula:

Where α is the learning rate, s _t and a _t are the state and action at time t, respectively, and the reward R _t+1 is the feedback from the environment obtained after the agent took action a _t in state s _t at time t, the discount factor γ defining the importance of the future reward, a value of 0 meaning that only short term rewards are considered, a value of 1 more emphasizes long term rewards.

The specific steps of the WoLF-PHC algorithm are as follows:

Step 1, M is the total number of RAU, M Q tables with the size of 2 multiplied by 2 and all zero values are generated, and initialization is carried out Wherein,Representing the channel vectors between all downlink RAUs and the downlink user j receiving the signal,G _i,j denotes a channel vector between the i-th uplink user and the j-th downlink user for a downlink precoding vector,The method comprises the steps of representing channels between an ith uplink user and all uplink RAUs, wherein G _I is a real interference channel matrix between a downlink RAUs and an uplink RAUs, initializing a learning rate alpha and an attenuation factor gamma, and initializing a positive update auxiliary parameter delta _w and a negative update auxiliary parameter delta _l; initialization strategyInitializing average estimation strategyThe |a _i | is the size of the motion space, and the number of times C(s) =0 that the initialization motion s occurs;

step 2, if the state of the RAU is uplink reception at this time, calculating the rewards according to the formula (6) after selecting the action according to the strategy, if the state of the RAU is downlink transmission at this time, calculating the rewards according to the formula (7) after selecting the action according to the strategy,

Step 3, the current state jumps to the next state according to the selected action;

step 4, updating the Q values of the exterior and the interior of the Q according to the step 8;

step 5, for each action, updating the average estimation strategy according to formula (1);

step 6, updating the strategy according to the Q value and each action and the formula (2-5);

step 7, returning to the step 2 to perform learning training until the strategy and the values of the Q surfaces and the Q insides converge;

And 8, returning the state and rewards of the optimal solution of each agent, and corresponding to the uplink and downlink modes of each RAU and the maximum user resource utility function value.

The beneficial effects are that: the invention provides a network-assisted full duplex non-honeycomb large-scale MIMO scene with expandability duplex mode optimization method suitable for a high-reliability low-delay communication (URLLC) scene of a massive terminal, and provides a WoLF-PHC intelligent algorithm with strong expandability for multi-agent distributed operation aiming at the problem of maximizing the resource utilization efficiency of a network-assisted full duplex uplink and downlink in the non-honeycomb large-scale MIMO scene. Wherein logical scalability is achieved in that each RAU is considered as an agent, comprising a local processor, which can perform its own associated data processing tasks and optimize local performance metrics based on decisions of other RAUs. When a new Remote Antenna Unit (RAU) is added to the system, the computing power of the CPU does not need to be upgraded, the data information transmitted by all RAUs does not need to be retrained at the CPU, and the distributed execution operation enables the system algorithm to be extensible. The intelligent distributed operation algorithm provided by the invention is more suitable for dynamic scenes of mass terminals URLLC, has lower complexity and has lower storage space requirement than the traditional centralized Q-learning algorithm.

Drawings

The drawing is a scene built in an example problem, and a resource utility function comparison drawing based on a Wolf-PHC algorithm and other algorithms is provided.

Fig. 1 is a position distribution diagram of uniformly distributed RAUs and randomly distributed uplink and downlink users;

FIG. 2 is a graph comparing CDFs of resource utility functions under different algorithms.

Fig. 3 is a schematic flow chart of the present invention.

Detailed Description

The invention is described in detail below with reference to examples:

Assuming a cell-free massive MIMO scenario within a circle, the M RAUs are evenly distributed within a circle with a radius of 600M. The system comprises K randomly distributed users, including K _u uplink users and K _d downlink users. The K uplink and downlink users are randomly distributed in a circle with a radius of 1000 m. Let us assume that m= 6,K _u =20 and kd=20, a specific scene distribution diagram is shown in fig. 1 of the drawings of the specification. The power of the noise is set to-90 dbm, the transmission power of the uplink is 30dbm, the transmission power of the downlink is 23dbm, and the path loss is 128.1+37.6log10 (d).

The implementation method of the invention in the system is as follows:

(1) Defining a load-aware utility function for each user i:

Wherein U _i is a load-aware utility function used to characterize the resource utilization of the system. K is the number of allocable total resource blocks each Remote Antenna Unit (RAU) has and K is the total number of users. n _m,i is the number of resource blocks allocated to user i by RAU m, which can be calculated by the following equation:

Wherein, Is the bandwidth used by user i according to its quality of service (QoS). b is the bandwidth occupied by each resource block, gamma _i is the signal-to-interference-and-noise ratio SINR of user i, R _i is the short packet achievable rate in the URLLC scenario, V (gamma _i) is the dispersion of user channels, m is the length of the short packet block, Q ^-1 (·) is the Q function inverse, epsilon ₀ is the error decoding probability (DEP), and e is the natural logarithm.Meaning rounded up. According to the property of the logarithmic function, when the whole network is overloaded, the user can preferentially select RAUs with smaller quantity of resource blocks to be allocated to the user on the premise of ensuring that the service quality of the user can be met, so that the utilization rate of the whole resource blocks of the system is improved. As the overall load of the network increases, the value of the user's load-aware utility function decreases. Furthermore, as the number of available allocated resource blocks owned by the RAU increases, the value of the load-aware utility function also increases. If the RAU cannot guarantee to meet the quality of service of the user i, the RAU will not provide the resource block for the user i at this time, and U _i =0. U _i, as a load-aware utility function, can therefore be used to characterize the resource utilization of the system.

(2) The optimization objective is to maximize the user's resource utility function based on load awareness:

Wherein, U _U,i is the load sensing utility function value of the uplink, U _D,j is the load sensing utility function value of the downlink, K _u is the number of users in the uplink, and K _d is the number of users in the downlink; u and d are the uplink and downlink identifiers, i and j are the ith uplink user and the jth downlink user, respectively; to determine in which mode each remote antenna unit RAU (remote antennaunit) should operate, two binary allocation vectors x _u,x_d∈{0,1}^M×1, M are defined as the total number of RAUs, the binary allocation vector of the ith uplink RAU or the binary allocation vector of the jth downlink RAU if the RAU is used for uplink or downlink The value is 1, otherwise, the value is 0; the effective load-sensing utility function values for the uplink and downlink can be represented by equations (5) and (6), respectively:

Wherein X _u＝diag(x_u),X_d＝diag(x_d is defined), diag (a) represents the formation of a diagonal matrix with element a; m _u is the number of uplink RAUs, M _d is the number of downlink RAUs, k _U,m is the number of resource blocks available for allocation by uplink RAUm, k _D,m is the number of resource blocks available for allocation by downlink RAUm, n _m,i is the number of resource blocks allocated to uplink user i by RAUm when the QoS requirement of uplink user i is met, and n _m,j is the number of resource blocks allocated to downlink user j by RAUm when the QoS requirement of downlink user j is met.

(3) And optimizing the resource utility function by using an intelligent algorithm, and storing a final state set and rewards of the algorithm as an optimal RAU duplex mode and maximized resource utilization efficiency.

In order to realize a network-assisted full duplex mode optimization technology based on load perception under a scene suitable for mass terminals URLLC, an intelligent algorithm based on Wolf-PHC q-learning is provided:

WoLF refers to a carefully slow adjustment parameter when the agent is doing better than the desired value, and a fast pace adjustment parameter when the agent is doing worse than the desired value.

PHC is a learning algorithm of single agent under stable environment. The core of the algorithm is the idea of general reinforcement learning, which increases the probability of choosing the action that can get the maximum cumulative expectation. The algorithm has rationality and can converge to an optimal strategy.

The Wolf-PHC intelligent algorithm is an extensible algorithm suitable for distributed execution of multiple agents. The WoLF algorithm and the PHC algorithm are combined, so that rewards obtained by the agents can be quickly adjusted to adapt to policy changes of other agents when the rewards are worse than expected, and the rewards are carefully learned when the rewards are better than expected, so that the other agents are given time for adapting to policy changes. And the WoLF-PHC algorithm can converge to a Nash equilibrium strategy, and when other agents adopt a certain fixed strategy, the WoLF-PHC algorithm can also converge to an optimal strategy under the current condition instead of converging to a Nash equilibrium strategy with possibly poor effect. The WoLF-PHC algorithm does not need to observe strategies, actions and rewarding values of other agents, less space is needed to record the Q value, and the WoLF-PHC algorithm learns and improves the strategies through the PHC algorithm, so that no linear programming or quadratic programming is needed to solve Nash equilibrium, and the algorithm speed is improved. In the context of mass terminals URLLC, distributed operations may make the algorithm logically scalable. Wherein the average estimation strategyThe update of (c) follows the following equation:

Where pi _i(s,a_i) is the policy to be taken under a particular state-action pair, and C(s) is the number of times state s occurs. Pi _i(s,a_i) is updated as follows:

Where Q (s, a) represents the cost function resulting from action a being taken in state s, updating formula reference (14).

In the Wolf-PHC algorithm, each RAU in the system is regarded as an agent independently, data detection and node mode selection are performed locally, and the RAU does not need to be uploaded to a central processing Center (CPU) for centralized calculation. For each agent, the state space has only two states s _t＝{s₁,s₂},s₁ to indicate that the RAU working mode is uplink receiving, s ₂ indicates that the RAU working mode is downlink transmitting, the action space is set to have only two actions a _t＝{a₁,a₂},a₁ to indicate that the RAU changes the original working mode, a ₂ indicates that the RAU keeps the original working mode unchanged, so that the size of the Q table is 2×2, if the total number of RAUs is M, the total space for storing Q values only needs m×2×2, which is far smaller than the storage space of the Q table of 2 ^M ×m required for centralized processing uploaded to the CPU, and the complexity is lower. To reward when the RAU is in the uplink reception mode, the following formula is:

When the RAU is in the downlink transmission mode, the following equation is awarded:

the update formula of the Q value is as follows:

Where α is the learning rate, s _t and a _t are the state and action at time t, respectively, and the reward R _t+1 is the feedback from the environment obtained after the agent took action a _t in state s _t at time t, the discount factor γ defining the importance of the future reward, a value of 0 meaning that only short term rewards are considered, a value of 1 more emphasizes long term rewards. The specific steps of the algorithm are as follows:

① M is the total number of RAUs, M Q tables with the size of 2 multiplied by 2 and all zero values are generated, and initialization is carried out Wherein,Representing the channel vectors between all downlink RAUs and the downlink user j receiving the signal,G _i,j denotes a channel vector between the i-th uplink user and the j-th downlink user for a downlink precoding vector,Representing the channel between the ith uplink user and all uplink RAUs, G _I is the real interference channel matrix between the downlink RAUs and the uplink RAUs, and the initialization strategyInitializing average estimation strategyThe |a _i | is the size of the motion space, and C(s) =0 is initialized;

② If the state of the RAU is uplink reception at this time, the prize is calculated according to the formula (12) after the policy selection action, if the state of the RAU is downlink transmission at this time, the prize is calculated according to the formula (13) after the policy selection action,

③ The current state jumps to the next state according to the selected action;

④ Updating the Q values of the Q surfaces and the Q surfaces according to the formula (14);

⑤ For each action, updating the average estimation strategy according to equation (7);

⑥ Updating the strategy according to the Q value and each action and the formula (8-11);

⑦ Returning to the step ② to perform learning training until the strategy and the values in the Q surface are converged;

⑧ And returning the state and rewards of the optimal solution of each agent, and corresponding to the uplink and downlink modes of each RAU and the maximum user resource utility function value.

Fig. 2 shows that the network-assisted full duplex non-cellular massive MIMO scenario with scalability under the high reliability and low delay communication (URLLC) scenario for massive terminals provided by the invention is higher than the fixed-mode scenario with the equal-division uplink and downlink RAU scheme and the time-division full duplex TDD scheme, and is close to the exhaustion algorithm with optimal performance theory, because the performance is slightly lower than the exhaustion algorithm on the network system level due to convergence to the nash equalization strategy in part of cases, the complexity of the algorithm provided by the invention is much lower than that of the exhaustion method, and compared with the centralized Q-learning scenario, the network-assisted full duplex non-cellular MIMO scenario with scalability has smaller calculation storage space and higher scalability, and is more suitable for the high reliability and low delay communication (URLLC) scenario for massive terminals.

Claims

1. A network auxiliary full duplex mode optimization method under a mass terminal URLLC is characterized in that: the method adopts an intelligent algorithm based on WoLF-PHC to optimize, and comprises the following steps:

step 1: defining a load-aware utility function for each user i:

Wherein X _u＝diag(x_u),X_d＝diag(x_d is defined), diag (a) represents the formation of a diagonal matrix with element a; m _u is the number of uplink RAUs, M _d is the number of downlink RAUs, k _U,m is the number of resource blocks available for allocation by uplink RAUm, k _D,m is the number of resource blocks available for allocation by downlink RAUm, n _m,i is the number of resource blocks allocated to uplink user i by RAUm if the QoS requirement of uplink user i is met, and n _m,j is the number of resource blocks allocated to downlink user j by RAUm if the QoS requirement of downlink user j is met;

2. The method for optimizing the network-assisted full duplex mode under the mass terminal URLLC according to claim 1, wherein the intelligent algorithm based on WoLF-PHC is:

3. The method for optimizing network-assisted full duplex mode under a massive terminal URLLC according to claim 2, wherein the WoLF-PHC algorithm includes an average estimation strategyThe update of (c) follows the following equation:

4. The method for optimizing network assisted full duplex mode under a massive terminal URLLC according to claim 3, wherein the RAU is uplink received, and the following formula is awarded:

when the RAU is downlink, the following formula is awarded:

5. the method for optimizing network assisted full duplex mode under a mass terminal URLLC according to claim 3, wherein the Q value is updated according to the following formula:

6. The method for optimizing the network-assisted full duplex mode under the mass terminal URLLC according to claim 3, wherein the specific steps of the WoLF-PHC algorithm are as follows: