D2D Mobile Relaying Meets NOMA—Part II: A Reinforcement Learning Perspective
<p>Cellular offloading using device-to-device (D2D) cooperative relaying.</p> "> Figure 2
<p>A decentralized reinforcement learning process.</p> "> Figure 3
<p>Convergence to pure Nash equilibrium using the linear reward-inaction algorithm.</p> "> Figure 4
<p>Convergence to pure Nash equilibrium using LRI, for different SINR threshold values (<math display="inline"><semantics> <msub> <mi>γ</mi> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msub> </semantics></math>).</p> "> Figure 5
<p>Convergence to pure Nash equilibrium using LRI, for different transmit powers of device 2.</p> "> Figure 6
<p>Convergence to pure Nash equilibrium using LRI, for different cooperation levels (<math display="inline"><semantics> <msub> <mi>x</mi> <mi>i</mi> </msub> </semantics></math>).</p> "> Figure 7
<p>The effect of channel fading on the convergence toward a pure Nash equilibrium using LRI.</p> "> Figure 8
<p>The effect of channel fading on the convergence to a mixed NE using BG dynamics.</p> "> Figure 9
<p>The effect of the SINR threshold (<math display="inline"><semantics> <msub> <mi>γ</mi> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msub> </semantics></math>) on the convergence to a mixed NE using BG dynamics.</p> "> Figure 10
<p>The effect of cooperation level (<math display="inline"><semantics> <msub> <mi>x</mi> <mi>i</mi> </msub> </semantics></math>) on the convergence to a mixed NE using BG dynamics.</p> "> Figure 11
<p>Long-term relaying probability under different transmit power schemes.</p> "> Figure 12
<p>Long-term relaying probability under fast fading and different SINR threshold values.</p> "> Figure 13
<p>Long-term relaying probability under fast fading and different cooperation levels <math display="inline"><semantics> <msub> <mi>x</mi> <mi>i</mi> </msub> </semantics></math>.</p> ">
Abstract
:1. Introduction
1.1. Motivations and New Trends
1.2. Our Contributions
- Part I’s
- contributions are related to performance analysis of a self-organizing D2D relaying scheme:
- We consider a hybrid two-tier scheme where cellular links use Non-Orthogonal Multiple Access (NOMA), whilst D2D links use Orthogonal Multiple Access (OMA). This scheme is suitable for both inband and outband D2D schemes.
- We fully characterize the Rayleigh channel model and derive closed forms for the outage probabilities of both OMA and NOMA links, and then compute the average throughput perceived by each device in the network.
- To the best of our knowledge, this work is the first to propose a biform game to capture the devices’ behaviors while deciding which radio access network (RAN) to connect (i.e., either cellular or D2D).
- Part II’s
- contributions are related to implementing self-organized mode selection using RL. Each device has to strategically learn when to select cellular mode and then act as a relay, and when it has to get access via D2D mode:
- 4.
- We propose to empower devices with a self-organization capability, allowing them to reach pure Nash equilibria (linear reward-inaction) and mixed Nash equilibria (Boltzmann–Gibbs dynamics), in a fully distributed manner;
- 5.
- We performed extensive simulations to analyze the effects of different parameters on the learning schemes. Insights on accuracy and convergence are also provided.
2. Related Work
3. The System
3.1. Channel Model
3.2. Average Throughput
4. Decentralized Learning Algorithms
4.1. Learning Pure Nash Equilibrium (PNE): The Linear Reward-Inaction Algorithm (LRI)
- -
- is the probability of relaying for device i at time t.
- -
- denotes the learning rate parameter at instant t (i.e., the step size).
- -
- is the environmental response, which is represented here as the fraction between the device i’s instantaneous throughput and the transmission rate.
- -
- denotes the indicator function that equals 1 when the action chosen by a device at time t is ; otherwise it equals 0.
- At every instant, each player chooses an action based on its action probability vector. Thus, player i chooses action at instant t, according to the probability distribution p;
- Each player obtains a reward based on its action and the set of all other devices’ actions. The reward of player i is ;
- Each player updates his action probability based on the rule in Equation (8).
Algorithm 1: Linear reward-inaction algorithm (LRI). |
Parameter Initialization Each device peaks randomly an initial probability of relaying . In parallel: at each device i do Repeat if then else Until The stopping criterion |
4.2. Learning Mixed Nash Equilibrium (MNE): Boltzmann–Gibbs-Based Payoff-RL
Algorithm 2: Boltzmann–Gibbs-based payoff-RL. |
Parameter Initialization Each device peaks randomly an initial probability of relaying . In parallel: at each device i do Repeat Until The stopping criterion |
5. Some Insights on RL Convergence, Fairness and Scalability
5.1. Convergence
5.2. Fairness
5.3. Scalability
6. Performance Analysis
6.1. Reaching a Pure Nash Equilibrium Using the Linear Reward-Inaction Algorithm
6.2. Reaching Mixed Equilibrium Using Boltzmann–Gibbs (BG) Algorithm
6.3. Long-Term Behavior
7. Conclusions and Perspectives
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Nawaz, S.J.; Sharma, S.K.; Wyne, S.; Patwary, M.; Asaduzzaman, M. Quantum Machine Learning for 6G Communication Networks: State-of-the-Art and Vision for the Future. IEEE Access 2019, 7, 46317–46350. [Google Scholar] [CrossRef]
- Alsharif, M.H.; Kelechi, A.H.; Albreem, M.A.; Chaudhry, S.A.; Zia, M.S.; Kim, S. Sixth Generation (6G) Wireless Networks: Vision, Research Activities, Challenges and Potential Solutions. Symmetry 2020, 12, 676. [Google Scholar] [CrossRef]
- Ali, S.; Saad, W.; Rajatheva, N.; Chang, K.; Steinbach, D.; Sliwa, B.; Wietfeld, C.; Mei, K.; Shiri, H.; Zepernick, H.J.; et al. 6G White Paper on Machine Learning in Wireless Communication Networks. arXiv 2020, arXiv:2004.13875. [Google Scholar]
- Dang, S.; Amin, O.; Shihada, B.; Alouini, M.S. What should 6G be? arXiv 2019, arXiv:1906.00741. [Google Scholar] [CrossRef] [Green Version]
- Aazhang, B.; Ahokangas, P.; Alves, H.; Alouini, M.S.; Beek, J.; Benn, H.; Bennis, M.; Belfiore, J.; Strinati, E.; Chen, F.; et al. Key Drivers and Research Challenges for 6G Ubiquitous Wireless Intelligence (White Paper); University of Oulu: Oulu, Finland, 2019. [Google Scholar]
- Saad, W.; Bennis, M.; Chen, M. A Vision of 6G Wireless Systems: Applications, Trends, Technologies, and Open Research Problems. IEEE Netw. 2020, 34, 134–142. [Google Scholar] [CrossRef] [Green Version]
- Tariq, F.; Khandaker, M.R.A.; Wong, K.K.; Imran, M.A.; Bennis, M.; Debbah, M. A Speculative Study on 6G. IEEE Wirel. Commun. 2020, 27, 118–125. [Google Scholar] [CrossRef]
- Piran, M.; Suh, D. Learning-Driven Wireless Communications, towards 6G. arXiv 2019, arXiv:1908.07335. [Google Scholar]
- Janbi, N.; Katib, I.; Albeshri, A.; Mehmood, R. Distributed Artificial Intelligence-as-a-Service (DAIaaS) for Smarter IoE and 6G Environments. Sensors 2020, 20, 5796. [Google Scholar] [CrossRef]
- Xiao, Y.; Shi, G.; Krunz, M. Towards Ubiquitous AI in 6G with Federated Learning. arXiv 2020, arXiv:2004.13563. [Google Scholar]
- Khan, L.U.; Pandey, S.R.; Tran, N.H.; Saad, W.; Han, Z.; Nguyen, M.N.H.; Hong, C.S. Federated Learning for Edge Networks: Resource Optimization and Incentive Mechanism. IEEE Commun. Mag. 2020, 58, 88–93. [Google Scholar] [CrossRef]
- Driouech, S.; Sabir, E.; Ghogho, M.; Amhoud, E.M. D2D Mobile Relaying Meets NOMA—Part I: A Biform Game Analysis. Sensors 2021, 21, 702. [Google Scholar] [CrossRef]
- Tembine, H. Distributed Strategic Learning for Wireless Engineers; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar] [CrossRef] [Green Version]
- Shahid, A.; Aslam, S.; Lee, K.G. A decentralized heuristic approach towards resource allocation in femtocell networks. Entropy 2013, 15, 2524–2547. [Google Scholar] [CrossRef] [Green Version]
- Lu, X.; Schwartz, H.M. Decentralized learning in two-player zero-sum games: A L-RI lagging anchor algorithm. In Proceedings of the 2011 American Control Conference, San Francisco, CA, USA, 29 June–1 July 2011; pp. 107–112. [Google Scholar]
- Elhammouti, H.; Sabir, E.; Benjillali, M.; Echabbi, L.; Tembine, H. Self-Organized Connected Objects: Rethinking QoS Provisioning for IoT Services. IEEE Commun. Mag. 2017, 55, 41–47. [Google Scholar] [CrossRef]
- Attaoui, W.; Sabir, E. Combined Beam Alignment and Power Allocation for NOMA-Empowered mmWave Communications; Ubiquitous Networking; Habachi, O., Meghdadi, V., Sabir, E., Cances, J.P., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 82–95. [Google Scholar]
- Al-Tous, H.; Barhumi, I. Distributed reinforcement learning algorithm for energy harvesting sensor networks. In Proceedings of the 2019 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Sochi, Russia, 3–6 June 2019; pp. 1–3. [Google Scholar]
- Park, H.; Lim, Y. Reinforcement Learning for Energy Optimization with 5G Communications in Vehicular Social Networks. Sensors 2020, 20, 2361. [Google Scholar] [CrossRef] [Green Version]
- Khan, M.; Alam, M.; Moullec, Y.; Yaacoub, E. Throughput-Aware Cooperative Reinforcement Learning for Adaptive Resource Allocation in Device-to-Device Communication. Future Internet 2017, 9, 72. [Google Scholar] [CrossRef] [Green Version]
- Zia, K.; Javed, N.; Sial, M.N.; Ahmed, S.; Pirzada, A.A.; Pervez, F. A distributed multi-agent RL-based autonomous spectrum allocation scheme in D2D enabled multi-tier HetNets. IEEE Access 2019, 7, 6733–6745. [Google Scholar] [CrossRef]
- Li, Z.; Guo, C.; Xuan, Y. A multi-agent deep reinforcement learning based spectrum allocation framework for D2D communications. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Big Island, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
- Handouf, S.; Sabir, E.; Sadik, M. A pricing-based spectrum leasing framework with adaptive distributed learning for cognitive radio networks. In International Symposium on Ubiquitous Networking; Springer: Berlin, Germany, 2015; pp. 39–51. [Google Scholar]
- Tathe, P.K.; Sharma, M. Dynamic actor-critic: Reinforcement learning based radio resource scheduling for LTE-advanced. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; pp. 1–4. [Google Scholar]
- Sun, P.; Li, J.; Lan, J.; Hu, Y.; Lu, X. RNN Deep Reinforcement Learning for Routing Optimization. In Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, 7–10 December 2018; pp. 285–289. [Google Scholar]
- Khodayari, S.; Yazdanpanah, M.J. Network routing based on reinforcement learning in dynamically changing networks. In Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05), Hong Kong, China, 14–16 November 2005; p. 5. [Google Scholar]
- Islam, S.R.; Avazov, N.; Dobre, O.A.; Kwak, K.S. Power-domain non-orthogonal multiple access (NOMA) in 5G systems: Potentials and challenges. IEEE Commun. Surv. Tutor. 2016, 19, 721–742. [Google Scholar] [CrossRef] [Green Version]
- Ding, Z.; Lei, X.; Karagiannidis, G.K.; Schober, R.; Yuan, J.; Bhargava, V.K. A survey on non-orthogonal multiple access for 5G networks: Research challenges and future trends. IEEE J. Sel. Areas Commun. 2017, 35, 2181–2195. [Google Scholar] [CrossRef] [Green Version]
- Tabassum, H.; Ali, M.S.; Hossain, E.; Hossain, M.; Kim, D.I. Non-orthogonal multiple access (NOMA) in cellular uplink and downlink: Challenges and enabling techniques. arXiv 2016, arXiv:1608.05783. [Google Scholar]
Symbol | Meaning |
---|---|
n | Number of active devices in the cell |
Transmission power vector of device i | |
Transmission power of device i while transmitting to the BS | |
Transmission power of device i while transmitting over the D2D link | |
Distance between device i and the BS | |
Channel gain of device i | |
SINR of device i | |
SINR-threshold | |
Outage probability of device i | |
Mean of the channel gain | |
R | Transmission rate |
Orthogonality factor between the carriers allocated to active devices j and k | |
Path-loss exponent in cellular and D2D, respectively | |
Fraction of throughput relay i gives to D2D devices | |
Utility of device i at time t when choosing the action |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Driouech, S.; Sabir, E.; Ghogho, M.; Amhoud, E.-M. D2D Mobile Relaying Meets NOMA—Part II: A Reinforcement Learning Perspective. Sensors 2021, 21, 1755. https://doi.org/10.3390/s21051755
Driouech S, Sabir E, Ghogho M, Amhoud E-M. D2D Mobile Relaying Meets NOMA—Part II: A Reinforcement Learning Perspective. Sensors. 2021; 21(5):1755. https://doi.org/10.3390/s21051755
Chicago/Turabian StyleDriouech, Safaa, Essaid Sabir, Mounir Ghogho, and El-Mehdi Amhoud. 2021. "D2D Mobile Relaying Meets NOMA—Part II: A Reinforcement Learning Perspective" Sensors 21, no. 5: 1755. https://doi.org/10.3390/s21051755
APA StyleDriouech, S., Sabir, E., Ghogho, M., & Amhoud, E. -M. (2021). D2D Mobile Relaying Meets NOMA—Part II: A Reinforcement Learning Perspective. Sensors, 21(5), 1755. https://doi.org/10.3390/s21051755