[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Codes with Weighted Poset Metrics Based on the Lattice of Subgroups of Zm
Previous Article in Journal
Research on the Security of NC-Link Numerical Control Equipment Protocol Based on Colored Petri Net
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Downlink Non-Orthogonal Multiple Access Power Allocation Algorithm Based on Double Deep Q Network for Ensuring User’s Quality of Service

School of Computer and Communication, Lanzhou University of Technology, No. 36, Pengjiaping Road, Qilihe District, Lanzhou 730050, China
*
Author to whom correspondence should be addressed.
Symmetry 2024, 16(12), 1613; https://doi.org/10.3390/sym16121613
Submission received: 11 November 2024 / Revised: 27 November 2024 / Accepted: 30 November 2024 / Published: 5 December 2024
(This article belongs to the Section Computer)

Abstract

:
Non-orthogonal multiple access (NOMA) provides higher spectral efficiency and access to more users than orthogonal multiple access. However, the issue of resource allocation in NOMA is dynamic and produces a high computation burden when using traditional methods. In this paper, a symmetry-aware double deep Q network (DDQN) algorithm in deep reinforcement learning is employed to allocate power to users in NOMA while guaranteeing quality of service for the weakest users. The research process is divided into two parts. Firstly, users in the communication system are grouped using a method that synergistically considers gain difference and similarity, exploiting symmetrical properties within the user groups. Secondly, the DDQN algorithm is used to allocate power to multiple users in a NOMA system, which utilizes the inherent symmetry in the signal-to-interference noise ratio of each user as an objective function. By recognizing and leveraging these symmetrical patterns, the algorithm can dynamically adjust the power allocation to optimize system performance. Finally, the proposed algorithm is compared with conventional NOMA power allocation algorithms and shows significant improvements in system performance. The results of the convergence function show that the algorithm proposed in this paper can converge in approximately 1800 iterations, which effectively solves the problem of large arithmetic and complex processes existing in the traditional method.

1. Introduction

The fifth generation wireless system (5G) is expected to reduce latency and improve spectral efficiency [1]. Traditional orthogonal multiple access (OMA) has a certain resource limitation that makes it unable to support a large number of users. These challenges can be eliminated by non-orthogonal multiple access (NOMA) technology. Recently, NOMA has attracted a lot of attention from systems such as the internet of things (IoT), telematics, heterogeneous networks, signal detection, and edge computing [2,3,4,5,6]. Multiple users in NOMA are transmitted on the same block of time and frequency resources. All users are distinguished by different power levels to accommodate more users and improve resource utilization. NOMA uses superposition coding to superpose users at the transmitter side and successive interference cancellation (SIC) to demodulate users at the receiver side. Although NOMA communication has many advantages, the resource allocation problem in NOMA is an NP-hard problem [7,8], with high computational complexity, which requires substantial computational resources. It is worth noting that using intelligent algorithms can reduce computational complexity [9]. Many scholars have adopted reinforcement learning models to solve dynamic resource allocation problems and have achieved excellent results [10,11]. In these studies, most optimize the system sum rate as the objective [12], which may result in suboptimal performance for some users within the algorithm. Therefore, building on previous work, this paper employs the Double Deep Q-Network (DDQN) algorithm to address the NP-hard problem, targeting the quality of service (QoS) for the worst-case users in the NOMA communication system as the optimization objective.
User grouping is the stage preceding power allocation, and a reasonable grouping of users can also enhance channel capacity. By collating and analyzing the work related to NOMA user grouping, several inspiring research results have been found. For instance, the literature [13] forms two queues for users in accordance with their gain ordering. The users in each queue are assigned to pairs one by one for transmission to solve the NOMA user grouping problem. While this grouping method fully accounts for the effects of gain differences, it ignores interference between users. Additionally, the literature [14] implements user grouping by pairing the highest channel gain user with the worst channel gain user. This method does not take into account the similarity between all users. Therefore, we consider these two conditions synergistically by approximating the channel gain difference using the Euclidean distance and the user similarity using the Cosine distance, aiming to achieve a balanced consideration of the complexity of the SIC receivers and the similarity between users.
After user grouping, power allocation is required. Methods for power allocation can be divided into two categories: one is using traditional mathematical methods, and the other is using intelligent algorithms. The literature [15,16] investigates NOMA power allocation through conventional methods. However, due to the dynamic nature of power allocation in NOMA, traditional methods are not suitable for solving such problems [17]. Many scholars have used machine learning (ML) methods to solve the problem. The literature [18] mentions a lightweight neural network used for optimizing beamforming and time allocation to address blockages caused by human movement in indoor visible light and RF hybrid communication systems. The discussed usage of neural networks and related allocation content is quite insightful. The literature [19] proposes a method combining convex optimization and machine learning to solve the joint power allocation and channel assignment problem in uplink NOMA networks, effectively improving resource allocation efficiency and system performance. The above literature demonstrates that using ML methods can better address these types of problems and more effectively adapt to the complexities of future wireless networks. Reinforcement learning (RL) methods can handle dynamic optimization problems by giving learning ability to the agent. In the literature [20], a collaborative Q-learning approach is mentioned. Based on the NOMA downlink scenario and base station (BS) with multiple antennas, a Q-table-based reinforcement learning algorithm is used, where each antenna is assigned a user corresponding to a Q-table. This method cannot be applied to scenarios with a large number of users due to the limited capacity of the Q-table. In the literature [21], a Q-learning-based resource allocation scheme is proposed to improve the quality of experience (QoE) of NOMA users in downlink wireless multimedia communications. The optimal resource allocation policy for the BS is obtained by setting the QoE utility as the return of the intelligentsia in the Q-learning process, which leads to an improvement of QoS. Setting QoE utility as the reward can indirectly improve QoS, but this method may not directly and comprehensively optimize the QoS for all users, especially in cases where user demands are diverse. Additionally, the literature [22] uses a way of maximizing the minimum rate, where the user with the smallest signal rate is promoted to achieve the maximization of the overall channel rate of the communication system, thereby improving the overall performance of the communication system. The algorithm searches for the weakest user in the channel continuously and optimizes the SINR of the weakest user so as to ensure an increase in the system rate. This article is based on the Q-learning algorithm. Q-learning has a slow convergence speed in certain complex environments, which may require a large number of training steps to find the optimal strategy. This does not align with the characteristics of the rapidly changing communication environment. The literature [23] investigates the combinatorial optimization problem of power allocation and dynamic user pairing for downlink multicarrier non-orthogonal multiple access system scenarios, with the objective of maximizing the users and rates of the whole system. Focusing only on overall communication performance may result in users with very poor communication quality within the cell not being optimized.
While the aforementioned methods perform well in many situations, they also have certain limitations. For example, algorithms based on Q-tables are limited by the table’s capacity and cannot handle scenarios with large state spaces. Q-learning algorithms may have a slow convergence rate in certain complex environments, which require a large number of training steps to find the optimal strategy. Additionally, methods that focus only on maximizing the overall rate may result in suboptimal performance for some users. To address these issues, we propose a method based on DDQN. DDQN can deal with complex problems such as high-state space and action space. Using DDQN for power allocation can effectively address the high computational complexity issues associated with traditional algorithms in resource allocation and significantly improve convergence speed and training stability. In NOMA systems, the DDQN algorithm treats the base station as an agent and the power allocation factors assigned by the base station to users as actions. Through continuous iteration, the value network converges, achieving balanced QoS for users within the cell, thereby achieving optimal resource allocation. The following are the contributions of this paper:
  • The gain difference and similarity of each channel are considered in the grouping process. And we construct the gain difference matrix and similarity matrix. The new matrix with both similarity and gain difference information is obtained by adding the two matrices and normalizing them. User matching is performed based on the data of the new matrix, which reduces the complexity of the receiver and the interference between users.
  • We use the poor user’s signal to interference noise ratio (SINR) as the optimization objective. The base station transmit power allocation factor is used as an action, the improvement of SINR is used as a reward, and the violation of the communication model is defined as a penalty. Power allocation is performed using a DDQN-based approach. And we use a DDQN-based method for power allocation.
  • The algorithm in this paper ensures user fairness and efficiency. From the simulation results, it can be seen that the channel rates of multiple users are gradually close to each other after iterations. The algorithm in this paper can balance the communication quality and ensure the basic communication of users in an emergency.
The remainder of this paper is organized as follows: Section 2 describes the downlink NOMA communication system model. The introduction of DDQN and the specific details of power allocation are presented in Section 3. Section 4 analyzes and discusses the experimental results. Section 5 presents the main conclusions.

2. System Model

In the NOMA communication system, the BS superimposes users with different power levels to transmit on the same frequency. This improves the spectral efficiency and system overall performance. We assume that the BS is located in the middle of the cell, and the users are randomly distributed within the coverage area of the BS, as shown in Figure 1. Let the set of users be noted as U = u 1 , u 2 , , u i , u N , where N is the total number of users in the cell. The symbols s i indicates the useful signals transmitted by u i . P t o t a l denotes the total transmit power allocated by the BS to all users in the cell. The power distribution factor is denoted by α i , so the power allocated to each user can be expressed as p i = α i P t o t a l . The power distribution coefficient α i should be constant and positive for any user in the cell, and the sum of the power distribution coefficients for all users in the cell is satisfies i = 1 N α i = 1 .
Then, the transmit signal x of the user in the cell can be expressed as follows:
x = i = 1 N p i s i
The received signal of y can be written as follows:
y i = h i x i + h i x j i ̲ = h i p i s i + h i j = 1 , j i N p j s j + n i ̲ ,
where the first item denotes the useful signal received by u i , and the underlined part denotes the interference in the channel. It contains both the interfering signal and the channel noise. n i denotes additive Gaussian white noise with zero mean and variance σ 2 . h i denotes the channel coefficient from the BS to the user. This paper assumes that h i > h i + 1 , i { 1 , 2 , , N 1 } . The receiver uses SIC technology to decode the user’s signal. The SIC process requires the user’s transmit power to be inversely proportional to its channel features, such as α i < α i + 1 , i { 1 , 2 , , N 1 } . The user with the highest power is decoded first and subtracted from the received signal. Then the next user’s signal is decoded in turn. This process repeats until the last signal is decoded completely. In SIC, the user’s SINR is given by the following formula:
γ i = h i 2 α i P t o t a l s i h i 2 j = 1 i 1 α j P t o t a l s j + σ 2
Assuming that the BS bandwidth is B and there are M sub-channels in the cell, the bandwidth of each sub-channel is B B M M . The number of users contained within each sub-channel is N. According to Shannon’s formula, the rate of the i-th user in the z-th channel can be expressed as follows:
c i z = B M log ( 1 + γ i ) = B M log 1 + h i 2 α i P t o t a l s i h i 2 j = 1 i 1 α j P t o t a l s j + σ 2 ,
where the signal of the interfering user is denoted by the index j. The sum rate of the system can be expressed as the sum of the rates of all users in all sub-channels. This is shown in the following formula:
C = z = 1 M i = 1 N M c i z

3. DDQN and Power Allocation

3.1. DDQN Algorithm

The DDQN algorithm is efficient and can handle complex problems. Therefore, in this paper, the DDQN algorithm is selected for power allocation. In DDQN, neural networks are used to approximate the optimal action value function Q s , a | θ Q * s , a . θ is a parameter of the neural network that adjust the network during each iteration to make the output more accurate. DDQN uses the temporal difference (TD) algorithm for the network parameter updating [24]. Different from DQN, which uses a single neural network to estimate Q-values, which can lead to overestimation errors. DDQN addresses this by using two separate networks: one to select the best action and another to evaluate its Q-value named target network, reducing the bias and resulting in more stable learning. This helps to improve the overall performance of the algorithm.
In NOMA communication systems, overestimation of the value function can lead to an overestimation of the power allocation factor output by the base station, resulting in larger adjustment steps. Specifically, this means that the power allocated to weaker users increases too much in one adjustment while the power allocated to stronger users decreases too little. To avoid a decline in the precision of the algorithm, it is necessary to address this overestimation issue; therefore, DDQN is used in this paper. The NOMA power allocation problem involves multiple variables, including channel conditions, SINR, power allocation factors, and so on. These variables make the decision-making process complex. Compared to traditional Q-learning and DQN, the dual-network structure of DDQN performs better in handling such complexity and multi-variable environments, allowing for more accurate learning and optimization of allocation strategies. The communication model in this paper assumes that channel conditions remain unchanged over a short period, which requires the algorithm to converge quickly and make stable decisions. DDQN excels in this aspect, being able to converge to the optimal strategy faster and maintain higher stability during the training process. The dual-network architecture of DDQN also makes it more efficient at processing data in distributed environments. In the future, it can be combined with distributed learning methods to more efficiently handle power allocation issues in large-scale NOMA systems. It is possible to build a multi-agent system where each agent is responsible for power allocation for different user groups, working collaboratively to improve overall performance. Q ^ represents the TD-target, and its calculation formula is as follows:
Q ^ = r t + γ max a t + 1 Q ( s t + 1 , a t + 1 θ ) ,
where max Q ( s t + 1 , a t + 1 | θ ) is the output of the target network, r t is the reward. γ is the discount factor. The discount factor decreases as the iteration progresses in order to balance the rewards between the current and later moments. The loss function is defined as follows:
L = Q ( s t , a t θ ) Q ^ ,
where, max Q ( s t , a t | θ ) is the output of the value network. The update formula for the parameter θ of the action value network is defined as follows:
θ α · L · Q ( s t , a t θ ) θ θ

3.2. Power Allocation Algorithm

The goal of this paper is to ensure the basic communication quality for all users and to maximize the rate of the users. The system rate is enhanced by improving the user of poor QoS within the system. Based on this, the objective function can be expressed as maximizing the user with the minimum channel rate. The formula is defined as follows:
max min i U B M log ( 1 + γ i )
i = 1 u α i = 1 .
h i > c t h , i U .
h i > h i + 1 , i U .
The constraint, Equation (9b), ensures that the total user power does not exceed the power budget. Constraint Equation (9c) limits the minimum transmission rate of the users to ensure that the users do not generate interruptions during the transmission process, and constraint Equation (9d) ensures that the decoding of the SIC is carried out successfully. Since the logarithmic function is an increasing function, the objective function in this paper can be simplified to the maximization of the user’s SINR as in the following formula:
max h i 2 α i P t o t a l s i h i 2 j = 1 i 1 α j P t o t a l s j + σ 2
i = 1 N α i = 1 .
c i > c t h , i = 0 , 1 , , N .
α i > 0 , i = 0 , 1 , , N .
Maximizing the overall QoS of the system is achieved by maximizing the SINR of the weaker users in the system. The DDQN algorithm consists of four main components: Agent, states, actions, and rewards. In order to solve the power allocation problem, this paper defines the agent, state space, action space, and reward function according to the reinforcement learning process.
  • Agent: In this paper, the BS is considered as the agent. In the downlink NOMA, the BS is responsible for acquiring the channel conditions. The BS allocates power to each user and sends out the signal with the allocated power;
  • State space: Since the optimization objective of this paper can be redefined to the improvement of the SINR of the user with the worst channel conditions within the NOMA downlink communication system. Due to the worst users are not fixed, this paper takes the SINR of all users as the state. It can be noted as S { γ 1 γ 2 γ N } . By identifying the symmetrical properties of users’ SINR under similar conditions, the algorithm can reduce the size of the state space and simplify the algorithm’s complexity. At the beginning of the algorithm, the user’s SINR is calculated based on the initial random state of the user and is used as an input to the neural network. The execution of the action changes the SINR. Then it moves to the next state;
  • Action space: The action of the DDQN is to change the value of the power allocation factor. Continuous problems can be quantified into discrete ones by setting the variation value. This process reduces the complexity and convergence time of the algorithm. The power used in this paper is continuous from 0 to P t o t a l . ξ acts as a regulatory factor. When the power allocation factor of a user increases ξ , the power allocation factor of the rest of the users has to decrease ξ / N 1 accordingly. This can be seen as a symmetrical operation because adjusting one factor affects the other factors. Thus, this way can guarantee constraint (10b), i.e., the sum of the power allocation factors of all users is not more than 1.
  • Reward function: In reinforcement learning, the reward function plays an important role. The reward function ensures that the agent performs better actions. Better actions can help achieve the objective function, and poor actions can make the objective function more difficult to achieve. Therefore, the following points need to be considered: Firstly, the channel gap between users needs to be decreased. This gap is expressed in two ways: the gap between user rates in a subchannel as well as the gap between the maximum and minimum user rates in the whole system. This process tries to achieve similar performance for all users in the NOMA communication system. Thus it is necessary to reduce the power allocation factor of the strong users and compensate for the weak users. The agent takes the variance of the user’s SINR as a reward. The action space prevents the agent from violating constraint (10b) during exploration. The communication model ensures that the agent does not violate constraint (10c). The reward function penalizes actions that violate constraint (10d). Thus, the reward function can be expressed as follows:
    r e w a r d = Δ s y s Δ c h a n n e l + β ,
where, Δ s y s denotes the rate differences between systems, Δ c h a n n e l denotes the rate difference between channels and β is a penalty. To avoid α i < 0 , we set β to a negative integer. In summary, the algorithm flow is shown in Algorithm 1:
Algorithm 1 NOMA power allocation algorithm based on DDQN
Input: learning rate α ; discount factor γ ; greedy factor ϵ ; neural network parameters θ ; batch size k; amount of change in action Δ a ; state space S;
Output:  θ (after training)
Initialization: Initialize the experience pool D capacity to N; set the total number of rounds I t o t a l ; initialize the action space;
 1:
for  i = 1   to   I t o t a l   do
 2:
   Initialize state s t ;
 3:
   for all steps of episode do
 4:
     if random.uniform() <  ϵ  then
 5:
        Randomly choose action a t from action space;
 6:
     else
 7:
        Choose action a t with highest action value;
 8:
     end if
 9:
     Execute action a t and obtain reward r t , next state s t + 1 ;
10:
     Store transition trajectory ( s t , a t , r t , s t + 1 ) ;
11:
     if steps > 100 then
12:
        Compute and update network parameters θ ;
13:
     end if
14:
     Update state: s t s t + 1 ;
15:
     Update target network every C steps;
16:
     if done then
17:
        break;
18:
     end if
19:
   end for
20:
end for
Return neural network parameters θ ;
As can be seen from the algorithm, the parameters are initialized first, including the learning rate of the neural network, the discount factor, the batch size, the state space, and the amount of action changes. Initially, a randomly assigned SINR is obtained from the environment as an input state for the first round. An iteration is started by selecting an action based on this state, and a reward for the action is obtained from the environment. The training process of the DDQN is as follows:
  • Initializing the value network including the number of layers, the type of layers, the activation functions, and the weights and biases of the network. Initialize the target network in the same way. Initializing the experience replay pool involves defining the capacity of the replay buffer, which determines how many experiences can be stored, as well as specifying the number of experiences to be sampled from the buffer at one time.
  • The agent begins by observing the current state of the environment and analyzes information about the current environment. Using this information, the agent then leverages its neural network to predict the Q value of each possible action. By evaluating these Q values, the agent can determine which action will likely maximize its rewards, guiding its decision-making process.
  • The agent chooses to execute the action with the highest Q value. Usually, an ε -greedy strategy is used in this process. In this approach, the agent randomly selects actions with a certain probability to ensure it explores the environment effectively. Specifically, with probability ε , the agent chooses a random action rather than the action that currently seems best according to its Q-value estimates. This random selection encourages the agent to try out different actions and discover potentially better strategies that it might not have considered if it always followed the action with the highest estimated Q-value.
  • The agent performs the selected action and observes the next state. It stores the current state, the selected action, the reward, and the next state in a memory replay pool and randomly selects a batch of experiences for updating the network.
  • The parameters of the value network are updated. The value network is periodically copied to the target network to maintain the stability of the target. This copying helps to stabilize the training process, reducing the oscillations and divergence that can occur when both the value and target networks are updated simultaneously. By periodically syncing the target network with the value network, the agent can more effectively learn from its experiences and improve its policy over time.
  • Repeat the above process until the algorithm converges.
The training process of the DDQN is an iterative process that continuously interacts with the environment and updates network parameters. The agent gradually learns the dynamic changes in the environment and optimizes its action strategy. Eventually, the trained value network can be used to guide the agent to make optimal action choices in a new state.
Figure 2 illustrates the specific coding framework, including the setup of the NOMA communication environment and the operational logic of the DDQN algorithm. The algorithm flowchart above mainly introduces how to use DDQN to optimize the NOMA power allocation process, but it does not mention how to construct the NOMA communication environment model. The lower part of Figure 2 provides a detailed demonstration of how to obtain the state from the NOMA communication environment and indicates which part of the DDQN algorithm this state needs to be transmitted to. Additionally, it shows how the NOMA communication environment selects appropriate power allocation factors as actions based on the state, and finally, how the rewards are given.

4. Simulation Results

In this paper, simulations have been performed using the previously mentioned algorithms. The parameters of the simulation section are shown in Table 1:
The reward function in this paper is designed as a negative value. As the reward function converges to zero, the algorithm converges and improves.
Deep reinforcement learning (DRL) algorithms use neural networks and require a loss function for updating. The decrease in the loss function indicates that the neural network output Q-value is becoming more and more accurate, i.e., the neural network is becoming more and more reliable. Also, the rate of convergence of the neural network loss function correlates with the time required for the algorithm to reach the objective function. Figure 3 shows the loss function and reward function image for the 6-user downlink NOMA communication scenario. The DDQN algorithm selected in this paper has a lower complexity than the other DRL algorithms. It can be seen that the loss function converges quickly. After about 1800 cycles, the agent’s first episode converges, exits, and moves on to the second episode. The first few rounds can be well rewarded. The agent can already make accurate judgments when the total number of cycles reaches about 5000. Because of the greed factor, the agent will also fluctuate in subsequent cycles, for example, at about 13,000 and 20,000 cycles. At these two moments, both the loss function and the reward function exhibit corresponding fluctuations. It can be shown that the algorithm mentioned in this paper achieves a reduction in complexity. This also shows that the execution time required for the algorithm to reach the goal is less, and the algorithm is more efficient.
Since it is inconvenient to observe the multi-user scenario, this paper simplifies the problem in the same scenario, as shown in Figure 4. The variation between the number of training iterations and the user SINR is plotted for the 2-user NOMA downlink scenario. From Figure 4, it can be seen that both users in the subchannel are in an increasing trend in the SINR after the start of training. User 1 has poor channel conditions and experiences a significant improvement. At the beginning, during the first 1200 iterations, the rates of both users in the channel are improved. At 1200 iterations, the SINR of the poor user and the strong user are close to each other. Consequently, their channel rates are maximized. However, after 1200 iterations, the user’s SINR experiences a short-term decline. This occurs because the algorithm converges around the 1200th iteration in the first episode and begins the second episode. At the start of each episode, the exploration factor ε is relatively high. As the episode progresses, the exploration factor gradually decreases. However, at the beginning of a new episode, there is a significant likelihood of the agent adopting a random selection strategy, which can cause a temporary downward trend in the algorithm’s performance. This situation is only temporary and does not affect the overall performance of the algorithm. Firstly, the goal of this paper is to guarantee the QoS for each user within the communication system. Secondly, the sum rate of the whole system should be increased as much as possible while guaranteeing the QoS. As the power allocation of the users gradually becomes more reasonable, the performance of these two users improves, as well as the overall performance of the subchannel.
From Figure 5, it can be seen that the rate of the DDQN-based power allocation algorithm is higher than the rate of Q-table-based power allocation. The rates of individual users follows this trend. Initially, the weak users of both algorithms started boosting at the same time. Due to table capacity limitations, the Q-table algorithm increases slowly and does not achieve better results. The disadvantages of the table based approach can be seen in this figure. The Q-table algorithm cannot explore the entire state space due to the limitation of table capacity. This also leads to the inability to achieve better performance.
Figure 6 shows the relationship between the user’s signal to noise ratio and the number of iterations in a 6-users downlink NOMA communication scenario. The goal of this paper is to ensure the QoS of the communicating users, so the overall rate of the system may not be ensured. As can be seen from the above figure, the final rate of users with better channel conditions will be reduced after multiple iterations. For example, User 1 and User 4 have better initial channel conditions, and after 10,000 iterations, their SINR starts to decrease. They sacrifice a part of their performance for the performance improvement of the poor user. Due to User 4’s larger drop causes the result to be lower than the system average reachable rate. So, at 15,000 iterations, the algorithm sets it as the optimization objective to adjust the power allocation coefficient so as to improve its SINR. However, the magnitude of the reduction is very small. In contrast, for users with poor channel conditions, the enhancement is very large. For the two poor users, User 1 and User 5, they are the initial main optimization targets. So, after the first 15,000 iterations, their SINR is improved dramatically. The algorithm converges after 30,000 iterations. We can see that eventually all the user’s SINR are close to each other. When the code runs, it uses the principle of symmetry to balance resource allocation between weak and strong users. Specifically, it achieves this by increasing the SINR of weak users while reducing the SINR of strong users, thus ensuring the QoS of the communication system. This approach enables fair and efficient resource allocation across the entire system. As a result, they all obtain a reasonable power allocation coefficient, and this algorithm achieves a performance optimization for poor users. It also ensures the fairness in communications. In natural disasters, emergencies, mountainous areas, and other special areas, it ensures that each user is able to have basic communication without being completely out of contact.
In Figure 7, this paper compares the channel capacity of a conventional OMA system, NOMA conventional power allocation algorithm, and DDQN based NOMA system at different power levels. From the figure, it can be seen that when we increase the transmit power of the base station, the channel rate of the user starts to rise. The higher the power, the stronger the rise. When the power is less than 10 watts, the user channel rate rises significantly for different algorithms. When the power is greater than 15 watts, the user channel rates of different algorithms rise more slowly. At all power levels, the NOMA system has better channel rates than the conventional OMA system. The NOMA system can obtain higher channel capacity. After allocating the user power using the optimization algorithm DDQN in this paper, it can be found that the user can obtain better channel capacity performance even in low transmit power scenarios. Higher channel capacity is achieved at any transmit power level compared to conventional NOMA power allocation algorithms. This will greatly improve the communication efficiency of the system.
Figure 8 shows the comparison among the DDQN based downlink NOMA power allocation algorithm, random allocation based NOMA algorithm, and OMA algorithm, and it can be observed that the system capacity of the DDQN based algorithm is higher than the traditional two algorithms in terms of users and rates. When the signal-to-noise ratio is 20 dB, using the DDQN algorithm for power allocation can achieve nearly three times the performance improvement compared to traditional algorithms. Furthermore, we observe that the system communication performance using the DDQN algorithm is more stable, without significant fluctuations. Traditional OMA and NOMA power allocation algorithms exhibit greater volatility at low signal-to-noise ratios and result in low channel capacity. The DDQN based algorithm, however, achieves high channel capacity even with a signal to noise ratio below 10 dB, ensuring optimal communication performance. It also can be concluded that the performance of the DDQN based algorithm is significantly improved over the conventional NOMA based random power allocation and OMA power allocation methods.
Figure 9 shows the convergence of the algorithm under different learning rates. An inappropriate learning rate can lead to an increase in the number of iterations and slower convergence. As shown in the figure, when the learning rate is set to 0.01, the algorithm converges rapidly. However, when the learning rates are set to 0.001 and 0.0001, the convergence time significantly increases. This can invalidate the assumption of constant channel conditions in the short term, thereby affecting the algorithm’s accuracy. Therefore, the learning rate in this paper is set to 0.01, as it has been shown to achieve higher learning efficiency and more accurate output from the value network.

5. Conclusions

In this paper, a DDQN based algorithm is used to allocate power to users in NOMA under the condition of guaranteeing minimum user QoS. The problem is divided into two parts. Firstly, users in communication systems are grouped by modified traditional methods. In particular, a grouping algorithm takes both gain difference and similarity into account and then power allocation, using a DDQN algorithm. Simulation results show that convergence can be achieved about 1800 times, leading to reliable conclusions and a rapid attainment of the objective function. And the algorithm can eventually make the SINR of all users in the cell close to each other, with lower complexity compared to traditional power allocation algorithms. The method in this paper has a higher SINR and higher user rate and can maintain the stability of the algorithm as the number of users increases. At a signal to noise ratio of 20 dB, the channel capacity using the DDQN algorithm is enhanced by a factor of 2 compared to traditional NOMA and by a factor of 10 compared to OMA. It enhances the overall performance of the communication system. However, the proposed algorithm still faces some limitations and challenges in practical applications. For example, the sum rate of the system cannot be optimal due to the optimization objective. For devices deployed on small UAVs, sufficient computational resources may be lacking. Moreover, dynamic planning and real-time adjustment strategies impose significant requirements on power consumption. Therefore, future research can focus on improving the efficiency of the algorithm and reducing the computational cost. Additionally, if the algorithm is applied to emergency sites or other special situations, it could provide basic communication services to users.

Author Contributions

Conceptualization, X.G. and Y.L.; methodology, X.G. and Y.L.; formal analysis, Y.X. and Y.L.; investigation, H.L. and X.W.; resources, X.G. and Y.X.; writing—original draft preparation, X.G. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under Grants 61841107 and 62061024, Gansu Natural Science Foundation under Grants 22JR5RA274 and 23YFGA0062, Gansu Innovation Foundation 2022A-215.

Data Availability Statement

The datasets presented in this article are not readily available because the data mentioned in this manuscript are private to laboratories and institutions.

Acknowledgments

The authors thank the editor, associate editor, and reviewer for their helpful suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

NOMANon-Orthogonal Multiple Access
OMAOrthogonal Multiple Access
DDQNDouble Deep Q Network
DRLDeep Reinforcement Learnin
QoSQuality of Service
SINRSignal to Interference Noise Ratio
5GThe Fifth Generation
IoTInternet Of Thing
SICSuccessive Interference Cancellation
RLReinforcement Learning
MLMachine Learning
BSBase Station
QoEQuality Of Experience
DQNDeep Q Learning

References

  1. Sufyan, A.; Khan, K.B.; Khashan, O.A.; Mir, T.; Mir, U. From 5G to beyond 5G: A comprehensive survey of wireless network evolution, challenges, and promising technologies. Electronics 2023, 12, 2200. [Google Scholar] [CrossRef]
  2. Sarker, I.H.; Khan, A.I.; Abushark, Y.B.; Alsolami, F. Internet of things (iot) security intelligence: A comprehensive overview, machine learning solutions and research directions. Mob. Netw. Appl. 2023, 28, 296–312. [Google Scholar] [CrossRef]
  3. Pham, Q.V.; Nguyen, H.T.; Han, Z.; Hwang, W.J. Coalitional games for computation offloading in NOMA-enabled multi-access edge computing. IEEE Trans. Veh. Technol. 2019, 69, 1982–1993. [Google Scholar] [CrossRef]
  4. Gures, E.; Shayea, I.; Ergen, M.; Azmi, M.H.; El-Saleh, A.A. Machine learning-based load balancing algorithms in future heterogeneous networks: A survey. IEEE Access 2022, 10, 37689–37717. [Google Scholar] [CrossRef]
  5. Al Homssi, B.; Dakic, K.; Wang, K.; Alpcan, T.; Allen, B.; Boyce, R.; Kandeepan, S.; Al-Hourani, A.; Saad, W. Artificial intelligence techniques for next-generation massive satellite networks. IEEE Commun. Mag. 2023, 62, 66–72. [Google Scholar] [CrossRef]
  6. Kong, L.; Tan, J.; Huang, J.; Chen, G.; Wang, S.; Jin, X.; Zeng, P.; Khan, M.; Das, S.K. Edge-computing-driven internet of things: A survey. ACM Comput. Surv. 2022, 55, 1–41. [Google Scholar] [CrossRef]
  7. Keti, F.; Atroshey, S.M.; Hamadamin, J.A. A Review of New Improvements in Resource Allocation Problem Optimization In 5G Using Non-Orthogonal Multiple Access. Acad. J. Nawroz Univ. 2022, 11, 245–254. [Google Scholar] [CrossRef]
  8. Islam, S.R.; Zeng, M.; Dobre, O.A.; Kwak, K.S. Resource allocation for downlink NOMA systems: Key techniques and open issues. IEEE Wirel. Commun. 2018, 25, 40–47. [Google Scholar] [CrossRef]
  9. Paulus, A.; Rolínek, M.; Musil, V.; Amos, B.; Martius, G. Comboptnet: Fit the right np-hard problem by learning integer programming constraints. In Proceedings of the International Conference on Machine Learning, Shenzhen, China, 26 February–1 March 2021; pp. 8443–8453. [Google Scholar]
  10. Hiesmayr, B.C. Free versus bound entanglement, a NP-hard problem tackled by machine learning. Sci. Rep. 2021, 11, 19739. [Google Scholar] [CrossRef]
  11. Zhang, X.; Gao, Q.; Gong, C.; Xu, Z. User grouping and power allocation for NOMA visible light communication multi-cell networks. IEEE Commun. Lett. 2016, 21, 777–780. [Google Scholar] [CrossRef]
  12. Cui, J.; Liu, Y.; Nallanathan, A. Multi-agent reinforcement learning-based resource allocation for UAV networks. IEEE Trans. Wirel. Commun. 2019, 19, 729–743. [Google Scholar] [CrossRef]
  13. Chen, S.; Peng, K.; Jin, H. A suboptimal scheme for uplink NOMA in 5G systems. In Proceedings of the 2015 International Wireless Communications and Mobile Computing Conference (IWCMC), Dubrovnik, Croatia, 24–28 August 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1429–1434. [Google Scholar]
  14. Lei, L.; Yuan, D.; Ho, C.K.; Sun, S. Power and channel allocation for non-orthogonal multiple access in 5G systems: Tractability and computation. IEEE Trans. Wirel. Commun. 2016, 15, 8580–8594. [Google Scholar] [CrossRef]
  15. Zeng, M.; Yadav, A.; Dobre, O.A.; Poor, H.V. Energy-efficient power allocation for MIMO-NOMA with multiple users in a cluster. IEEE Access 2018, 6, 5170–5181. [Google Scholar] [CrossRef]
  16. Wang, D.; Chen, D.; Song, B.; Guizani, N.; Yu, X.; Du, X. From IoT to 5G I-IoT: The next generation IoT-based intelligent algorithms and 5G technologies. IEEE Commun. Mag. 2018, 56, 114–120. [Google Scholar] [CrossRef]
  17. Zhai, Q.; Boli´c, M.; Li, Y.; Cheng, W.; Liu, C. A Q-learning-based resource allocation for downlink non-orthogonal multiple access systems considering QoS. IEEE Access 2021, 9, 72702–72711. [Google Scholar] [CrossRef]
  18. Palitharathna, K.W.; Suraweera, H.A.; Godaliyadda, R.I.; Herath, V.R.; Ding, Z. Neural network-based blockage prediction and optimization in lightwave power transfer-enabled hybrid VLC/RF systems. IEEE Internet Things J. 2023, 11, 5237–5248. [Google Scholar] [CrossRef]
  19. Ghanbarzadeh, V.; Zahabi, M.; Amiriara, H.; Jafari, F.; Kaddoum, G. Resource allocation in NOMA Networks: Convex optimization and stacking ensemble machine learning. IEEE Open J. Commun. Soc. 2024, 5, 5276–5288. [Google Scholar] [CrossRef]
  20. He, S.; Wang, W. QoE-aware Q-learning resource allocation for NOMA wireless multimedia communications. IET Netw. 2020, 9, 262–269. [Google Scholar] [CrossRef]
  21. Guo, W.; Wang, X. Power Allocation for Secure NOMA Network Based on Q-learning. IEEE Access 2024, 12, 104833–104845. [Google Scholar] [CrossRef]
  22. Wang, X.; Meng, K.; Wang, X.; Liu, Z.; Ma, Y. Dynamic user resource allocation for downlink multicarrier NOMA with an actor–critic method. Energies 2023, 16, 2984. [Google Scholar] [CrossRef]
  23. Siddiqi, U.F.; Sait, S.M.; Uysal, M. Deep Q-learning based optimization of VLC systems with dynamic time-division multiplexing. IEEE Access 2020, 8, 120375–120387. [Google Scholar] [CrossRef]
  24. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Figure 1. Schematic diagram of system model.
Figure 1. Schematic diagram of system model.
Symmetry 16 01613 g001
Figure 2. DDQN-based NOMA power allocation algorithm.
Figure 2. DDQN-based NOMA power allocation algorithm.
Symmetry 16 01613 g002
Figure 3. Relationship between reward function, loss function, and the number of iterations.
Figure 3. Relationship between reward function, loss function, and the number of iterations.
Symmetry 16 01613 g003
Figure 4. Relationship between the number of iterations and a user’s SINR in downlink two-user NOMA scenarios.
Figure 4. Relationship between the number of iterations and a user’s SINR in downlink two-user NOMA scenarios.
Symmetry 16 01613 g004
Figure 5. Comparison of DDQN algorithm with Q-tab algorithm.
Figure 5. Comparison of DDQN algorithm with Q-tab algorithm.
Symmetry 16 01613 g005
Figure 6. Relationship between the number of iterations and SINR in NOMA system with 6-users.
Figure 6. Relationship between the number of iterations and SINR in NOMA system with 6-users.
Symmetry 16 01613 g006
Figure 7. Variation of channel rate with power for NOMA and OMA systems.
Figure 7. Variation of channel rate with power for NOMA and OMA systems.
Symmetry 16 01613 g007
Figure 8. Channel capacity comparison of different algorithms.
Figure 8. Channel capacity comparison of different algorithms.
Symmetry 16 01613 g008
Figure 9. Convergence of the algorithm at different learning rates.
Figure 9. Convergence of the algorithm at different learning rates.
Symmetry 16 01613 g009
Table 1. Simulation parameter.
Table 1. Simulation parameter.
ParameterMeaningValue
Nnumber of users2, 6
Mnumber of channel1, 3
P t o t a l total power3 W
E p i s o d e number of rounds100
β punishing factor−3
ς quantification interval0.0001
ε greed factor0.1
α learning rate0.01
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, Y.; Gong, X.; Xiong, Y.; Li, H.; Wang, X. Downlink Non-Orthogonal Multiple Access Power Allocation Algorithm Based on Double Deep Q Network for Ensuring User’s Quality of Service. Symmetry 2024, 16, 1613. https://doi.org/10.3390/sym16121613

AMA Style

Lin Y, Gong X, Xiong Y, Li H, Wang X. Downlink Non-Orthogonal Multiple Access Power Allocation Algorithm Based on Double Deep Q Network for Ensuring User’s Quality of Service. Symmetry. 2024; 16(12):1613. https://doi.org/10.3390/sym16121613

Chicago/Turabian Style

Lin, Ying, Xingbo Gong, Yongwei Xiong, Haomin Li, and Xiangcheng Wang. 2024. "Downlink Non-Orthogonal Multiple Access Power Allocation Algorithm Based on Double Deep Q Network for Ensuring User’s Quality of Service" Symmetry 16, no. 12: 1613. https://doi.org/10.3390/sym16121613

APA Style

Lin, Y., Gong, X., Xiong, Y., Li, H., & Wang, X. (2024). Downlink Non-Orthogonal Multiple Access Power Allocation Algorithm Based on Double Deep Q Network for Ensuring User’s Quality of Service. Symmetry, 16(12), 1613. https://doi.org/10.3390/sym16121613

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop