Multi-Agent Reinforcement Learning: A Review of Challenges and Applications
<p>(<b>a</b>) In the single-agent RL paradigm, an agent interacts with an environment by performing an action for which it receives a reward. (<b>b</b>) In the MARL paradigm, from the agent’s point of view, the other agents may be considered to be part of the environment, which of course changes due to the actions of all the agents.</p> "> Figure 2
<p>The structure of an actor–critic algorithm.</p> "> Figure 3
<p>The taxonomy of reinforcement learning algorithms.</p> "> Figure 4
<p>Challenges encountered by each algorithm.</p> "> Figure 5
<p>Traffic junction environment.</p> ">
Abstract
:1. Introduction
2. Background
2.1. Multi-Agent Framework
2.1.1. Markov Decision Process
- S is the state space;
- A is the action space;
- is the transition probability from state to given the action ;
- is the reward function, whose value is the reward received by the agent for a transition from the state–action pair to the state ;
- is the discount factor and is a parameter used to compensate for the effect of instantaneous and future rewards.
2.1.2. Markov Game
- is the set of agents;
- S is the space observed by all agents;
- is the action space of the i-th agent and is called the joint action space;
- is the transition probability to each state given a starting state and a joint action ;
- is the reward function of the i-th agent representing the instantaneous reward received, transitioning from to ;
- is the discount factor.
2.1.3. A Partially-Observable Markov Decision Process
2.1.4. Dec-POMDP
- I is the set of n agents;
- S is the state space;
- is the joint action space;
- is the observation space with .
2.2. Single-Agent RL Algorithms
- The critic, which has the task of estimating the value function, typically using TD methods;
- The actor, which represents the parameterized policy and updates its action distribution in the direction "suggested" by the critic using a policy gradient.
2.2.1. Q-Learning
2.2.2. REINFORCE
- Initialize the policy parameters at random;
- Use to generate a trajectory, which is a sequence of states, actions and rewards, ;
- For each time-step ,
- Estimate the return ;
- Update policy parameters using Equation (12);
- Iterate the process.
2.2.3. A3C
Algorithm 1 A3C Pseudocode [20]. |
|
3. The Limits of Multi-Agent Reinforcement Learning
3.1. Nonstationarity
Varying Learning Speed
3.2. Scalability
Deep Reinforcement Learning
3.3. Partial Observability
3.3.1. Centralized Learning of Decentralized Policies
Algorithm 2 COMA pseudocode [39]. |
|
3.3.2. Communications between Agents
4. Benchmark Environments for Multi-Agent Systems
5. Applications
6. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ML | Machine Learning |
RL | Reinforcement Learning |
MAS | Multi Agent System |
MARL | Multi Agent Reinforcement Learning |
MDP | Markov Decision Process |
MG | Markov Game |
POMDP | Partially-Observable Markov Decision Process |
Dec-POMDP | Decentralized Partially Observable Markov Decision Process |
TD | Temporal Difference |
DQN | Deep Q-Network |
IL | Independent Learner |
JAL | Joint Action Learner |
Q-RTS | Q Learning Real Time Swarm |
LMRL | Lenient Multi-agent Reinforcement Learning |
DRL | Deep Reinforcement Learning |
DQN | Deep Q-Network |
cDQN | Contextual DQN |
LDQN | Lenient Deep Q-Network |
DRQN | Deep Recurrent Q-Network |
LSTM | Long Short Term Memory |
Dec-HDRQN | Decentralized Hysteretic Deep Recurrent Q-Network |
CERT | Concurrent Experience Replay Trajectories |
TRPO | Trust Region Policy Optimization |
PS-TRPO | Parameter Sharing Trust Region Policy Optimization |
COMA | Counterfactual Multi-Agent |
A3C | Asynchronous Advantage Actor-Critic |
cA2C | Contextual Asynchronous Advantage Actor-Critic |
VDN | Value Decomposition Network |
DDRQN | Distributed Deep Recurrent Q-Network |
MA | Multi-Agent |
MADQN | Multi-Agent Deep Q-Network |
SC2LE | Starcraft 2 Learning Environment |
UAV | Unmanned Aerial vehicle |
FOV | Field of View |
MUTAPP | Multi-UAV Target Assignment and Path Planning |
MADDPG | Multi Agent Deep Deterministic Policy Gradient |
LoS | Line of Sight |
nZEC | Nearly Zero Energy Community |
CMS | Community Monitoring Service |
ES-MARL | Equilibrium Selection Multi Agent Reinforcement Learning |
RG | Renewable Generator |
EV | Electric Vehicle |
V2G | Vehicle to Grid |
PPO | Proximal Policy Optimization |
RCS | Reactive Control Strategy |
PCS | Predictive Control Strategy |
GVM | Gross Volume of Merchandise |
IDQN | Independent Deep Q-Network |
CPR | Common Pool Resources |
SSD | Sequential Social Dilemma |
BS | Base Station |
References
- Yang, H.; Liu, X.Y.; Zhong, S.; Walid, A. Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy. SSNR 2020. [Google Scholar] [CrossRef]
- Abbeel, P.; Darrell, T.; Finn, C.; Levine, S. End-to-End Training of Deep Visuomotor Policies. J. Mach. Learn. Res. 2016, 17, 1334–1373. [Google Scholar]
- Konar, A.; Chakraborty, I.G.; Singh, S.J.; Jain, L.C.; Nagar, A.K. A deterministic improved q-learning for path planning of a mobile robot. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2013, 43. [Google Scholar] [CrossRef] [Green Version]
- Lin, J.L.; Hwang, K.S.; Jiang, W.C.; Chen, Y.J. Gait Balance and Acceleration of a Biped Robot Based on Q-Learning. IEEE Access 2016, 4. [Google Scholar] [CrossRef]
- Panagiaris, N.; Hart, E.; Gkatzia, D. Generating unambiguous and diverse referring expressions. Comput. Speech Lang. 2021, 68. [Google Scholar] [CrossRef]
- Matta, M.; Cardarilli, G.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Nannarelli, A.; Re, M.; Spanò, S. A reinforcement learning-based QAM/PSK symbol synchronizer. IEEE Access 2019, 7. [Google Scholar] [CrossRef]
- Stone, P.; Veloso, M. Multiagent systems: A survey from a machine learning perspective. Auton. Robots 2000, 8, 345–383. [Google Scholar] [CrossRef]
- Zhuang, Y.; Hu, Y.; Wang, H. Scalability of Multiagent Reinforcement Learning. In Interactions in Multiagent Systems; Chapter 1; World Scientific: Singapore, 2000; pp. 1–17. [Google Scholar] [CrossRef]
- Thorndike, E.L. Animal Intelligence: An experimental study of the associate processes in animals. Am. Psychol. 1998, 58, 1125–1127. [Google Scholar] [CrossRef]
- Monahan, G.E. A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms. Manag. Sci. 1982, 28, 1–16. [Google Scholar] [CrossRef] [Green Version]
- Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The Complexity of Decentralized Control of Markov Decision Processes. Math. Oper. Res. 2002, 27, 819–840. [Google Scholar] [CrossRef]
- Sutton, R.S. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In Machine Learning Proceedings; Morgan Kaufmann: Burlington, MA, USA, 1990; pp. 216–224. [Google Scholar] [CrossRef] [Green Version]
- Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef] [Green Version]
- Watkins, C. Learning from Delayed Rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning I: Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Sutton, R. Learning to Predict by the Method of Temporal Differences. Mach. Learn. 1988, 3, 9–44. [Google Scholar] [CrossRef]
- Watkins, C.J.C.H. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 2004, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Harley, T.; Lillicrap, T.P.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 9–24 June 2016; Volume 48, pp. 1928–1937. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Tan, M. Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. In Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
- Claus, C.; Boutilier, C. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, WI, USA, 26–30 July 1998; pp. 746–752. [Google Scholar]
- Matta, M.; Cardarilli, G.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Silvestri, F.; Spanò, S. Q-RTS: A real-time swarm intelligence based on multi-agent Q-learning. Electron. Lett. 2019, 55, 589–591. [Google Scholar] [CrossRef]
- Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Matta, M.; Nannarelli, A.; Re, M.; Spanò, S. FPGA Implementation of Q-RTS for Real-Time Swarm Intelligence systems. In Proceedings of the 2020 Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 1–4 November 2020. [Google Scholar]
- Matignon, L.; Laurent, G.J.; Le Fort-Piat, N. Hysteretic Q-learning: An algorithm for Decentralized Reinforcement Learning in Cooperative Multi-Agent Teams. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 29 October–2 November 2007; pp. 64–69. [Google Scholar] [CrossRef] [Green Version]
- Bloembergen, D.; Kaisers, M.; Tuyls, K. Lenient Frequency Adjusted Q-Learning; University of Luxemburg: Luxembourg, 2010; pp. 19–26. [Google Scholar]
- Kar, S.; Moura, J.M.F.; Poor, H.V. QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations. IEEE Trans. Signal Process. 2013, 61, 1848–1862. [Google Scholar] [CrossRef] [Green Version]
- Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; Başar, T. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents; ML Research Press: Maastricht, NL, USA, 2018. [Google Scholar]
- Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2016, arXiv:1509.02971. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML’15, Lille, France, 6–11 July 2015; Volume 37, pp. 448–456. [Google Scholar] [CrossRef]
- Palmer, G.; Tuyls, K.; Bloembergen, D.; Savani, R. Lenient Multi-Agent Deep Reinforcement Learning. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, Stockholm, Sweden, 10–15 July 2018; pp. 443–451. [Google Scholar]
- Foerster, J.; Nardelli, N.; Farquhar, G.; Torr, P.; Kohli, P.; Whiteson, S. Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Hausknecht, M.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA15), Arlington, VA, USA, 12–14 November 2015. [Google Scholar]
- Omidshafiei, S.; Pazis, J.; Amato, C.; How, J.P.; Vian, J. Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; PMLR, International Convention Centre: Sydney, Australia, 2017; Volume 70, pp. 2681–2690. [Google Scholar]
- Gupta, J.K.; Egorov, M.; Kochenderfer, M. Cooperative Multi-agent Control Using Deep Reinforcement Learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, São Paulo, Brazil, 8–12 May 2017; Volume 10642. [Google Scholar] [CrossRef]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 1889–1897. [Google Scholar]
- Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum Learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML’09, Montreal, QC, Canada, 14–18 June 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 41–48. [Google Scholar] [CrossRef]
- Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the AAAI 2018, Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, International Foundation for Autonomous Agents and Multiagent Systems, AAMAS’18, Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar] [CrossRef]
- Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4295–4304. [Google Scholar]
- Sukhbaatar, S.; Szlam, A.; Fergus, R. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems, Proceedings of the 30th Annual Conference on Neural Information Processing Systems, NIPS 2016, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2252–2260. [Google Scholar]
- Foerster, J.N.; Assael, Y.M.; de Freitas, N.; Whiteson, S. Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks. arXiv 2016, arXiv:1602.02672. [Google Scholar]
- Sukhbaatar, S.; Szlam, A.; Synnaeve, G.; Chintala, S.; Fergus, R. MazeBase: A Sandbox for Learning from Games; Cornell University Library: Ithaca, NY, USA, 2016. [Google Scholar]
- Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Sasha Vezhnevets, A.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.; Schrittwieser, J.; et al. StarCraft II: A New Challenge for Reinforcement Learning. arXiv 2017, arXiv:1708.04782. [Google Scholar]
- MuJoCo: Advanced Physics Simulation. Available online: http://www.mujoco.org (accessed on 10 April 2021).
- Pham, H.X.; La, H.M.; Feil-Seifer, D.; Nefian, A. Cooperative and Distributed Reinforcement Learning of Drones for Field Coverage. arXiv 2018, arXiv:1803.07250. [Google Scholar]
- Qie, H.; Shi, D.; Shen, T.; Xu, X.; Li, Y.; Wang, L. Joint Optimization of Multi-UAV Target Assignment and Path Planning Based on Multi-Agent Reinforcement Learning. IEEE Access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]
- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6382–6393. [Google Scholar]
- Cui, J.; Liu, Y.; Nallanathan, A. Multi-Agent Reinforcement Learning-Based Resource Allocation for UAV Networks. IEEE Trans. Wirel. Commun. 2020, 19, 729–743. [Google Scholar] [CrossRef] [Green Version]
- Gale, D.; Shapley, L.S. College Admissions and the Stability of Marriage. Am. Math. Mon. 1962, 69, 9–15. [Google Scholar] [CrossRef]
- Shamsoshoara, A.; Khaledi, M.; Afghah, F.; Razi, A.; Ashdown, J. Distributed Cooperative Spectrum Sharing in UAV Networks Using Multi-Agent Reinforcement Learning. In Proceedings of the 2019 16th IEEE Annual Consumer Communications Networking Conference (CCNC), Las Vegas, NV, USA, 11–14 January 2019; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
- Jung, S.; Yun, W.J.; Kim, J.; Kim, J.H. Coordinated Multi-Agent Deep Reinforcement Learning for Energy-Aware UAV-Based Big-Data Platforms. Electronics 2021, 10, 543. [Google Scholar] [CrossRef]
- Kong, X.; Xin, B.; Wang, Y.; Hua, G. Collaborative Deep Reinforcement Learning for Joint Object Search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Prasad, A.; Dusparic, I. Multi-agent Deep Reinforcement Learning for Zero Energy Communities. In Proceedings of the 2019 IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe), Bucharest, Romania, 29 September–2 October 2019; pp. 1–5. [Google Scholar] [CrossRef] [Green Version]
- Fang, X.; Wang, J.; Song, G.; Han, Y.; Zhao, Q.; Cao, Z. Multi-Agent Reinforcement Learning Approach for Residential Microgrid Energy Scheduling. Energies 2020, 13, 123. [Google Scholar] [CrossRef] [Green Version]
- Hu, J.; Wellman, M.P. Nash Q-Learning for General-Sum Stochastic Games. J. Mach. Learn. Res. 2003, 4, 1039–1069. [Google Scholar]
- Roesch, M.; Linder, C.; Zimmermann, R.; Rudolf, A.; Hohmann, A.; Reinhart, G. Smart Grid for Industry Using Multi-Agent Reinforcement Learning. Appl. Sci. 2020, 10, 6900. [Google Scholar] [CrossRef]
- Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving. arXiv 2016, arXiv:1610.03295. [Google Scholar]
- Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef] [Green Version]
- Lin, K.; Zhao, R.; Xu, Z.; Zhou, J. Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1774–1783. [Google Scholar]
- Calvo, A.; Dusparic, I. Heterogeneous Multi-Agent Deep Reinforcement Learning for Traffic Lights Control. In Proceedings of the 26th Irish Conference on Artificial Intelligence and Cognitive Science, Dublin, Ireland, 6–7 December 2018. [Google Scholar]
- Krajzewicz, D.; Erdmann, J.; Behrisch, M.; Bieker, L. Recent Development and Applications of SUMO –Simulation of Urban Mobility. Int. J. Adv. Syst. Meas. 2012, 5, 128–138. [Google Scholar]
- Sui, Y.; Song, S. A Multi-Agent Reinforcement Learning Framework for Lithium-ion Battery Scheduling Problems. Energies 2020, 13, 1982. [Google Scholar] [CrossRef] [Green Version]
- Kim, H.; Shin, K.G. Scheduling of Battery Charge, Discharge, and Rest. In Proceedings of the 2009 30th IEEE Real-Time Systems Symposium, Washington, DC, USA, 1–4 December 2009; pp. 13–22. [Google Scholar] [CrossRef] [Green Version]
- Perolat, J.; Leibo, J.Z.; Zambaldi, V.; Beattie, C.; Tuyls, K.; Graepel, T. A multi-agent reinforcement learning model of common-pool resource appropriation. arXiv 2017, arXiv:1707.06600. [Google Scholar]
- Leibo, J.Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; Graepel, T. Multi-Agent Reinforcement Learning in Sequential Social Dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS’17, São Paulo, Brazil, 8–12 May 2017; International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2017; pp. 464–473. [Google Scholar]
- Pandey, B. Adaptive Learning for Mobile Network Management. G2 pro Gradu, Diplomityö. Master’s Thesis, University of Aalto, Espoo, Finland, 12 December 2016. [Google Scholar]
Algorithm | Type of Agent | Learning Structure | Features | Scientific Spreading |
---|---|---|---|---|
Hysteretic Q-Learning [26] | Value based | Independent learners | Uses a different learning rate for increasing and decreasing Q-values. No need for communication between agents | 1100 |
Lenient Q-Learning [27] | Value based | Independent learners | Accumulates rewards for a state–action-pair and then update it using the maximum reward | 333 |
QD-learning [28] | Value based | Networked agents | Receives the Q-values from agents in the proximity with the objective of minimizing the difference and reaching consensus | 104 |
Actor–Critic with Networked Agents [29] | Actor–critic | Networked agents | Both the policy of the actor and the Q-estimates of the critic are parameterized. The agents share the parameters of their critic to reach consensus | 3050 |
Lenient deep Q-network [32] | Value based | Independent agents | Stores a temperature variable in the experience replay to decide the amount of leniency to apply to the updates | 97 |
Multi-Agent Deep Q-Network [33] | Value based | Independent agents | Use of importance sampling and low-dimensional fingerprints to disambiguate samples in the experience replay | 3380 |
Dec-HDQRN [35] | Value based | Independent agents | Integrates the recursive neural network to estimate the non observed state and hysteretic Q-learning to address non stationarity. Possibility to use transfer learning to adapt to multi task applications | 49 |
PS-TRPO [36] | Policy optimization | Centralized training | Shares parameters between agents during training. The policy parameters are bounded to change in a trust region. Can scale progressively to more agents using curriculum learning | 23 |
COMA [39] | Actor–critic | Centralized training | A centralized critic is used only during the training phase. Can differentiate rewards between agents using a conterfactual baseline | 4750 |
VDN [40] and QMIX [41] | Value based | Networked agents | The Q-table of the joint action can be factorized as a sum or a combination of independent Q-Tables | 847 |
CommNet [42] | Policy optimization | Networked agents | The agents communicate for a number of rounds before selecting their action. The communication protocol is learned concurrently with the optimal policy | 319 |
DDRQN [43] | Value based | Networked agents | Uses a deep recursive network architecture in a partial observable setting with the use of parameter sharing to speed-up the learning process | 3510 |
Q-RTS [24] | Value based | Centralized training | Agents create a global knowledge Q-matrix combining their most valuable experiences and make updates on a linear combination of the matrix and their local Q-table. | 108 |
Sector | Applications |
---|---|
UAVs | Drone field coverage [47] Target assignment and path planning [48] LoS Networks [50] Packet routing relaying [52] Recharging towers [53] |
Image processing | Joint active objective search [54] |
Energy sharing and scheduling | Zero-energy community [55] Residential microgrid with V2G exchange [56] Industry production control [58] Lithium battery lifetime optimization [64] |
Automotive | Safe driving [59] Fleet control for ride sharing platform [61] Intersection traffic light control [62] |
Social Science | Common Pool Resource Approximation [66] Sequential Social Dilemmas [67] |
Networking | Base-station parameter approximation [68] |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Canese, L.; Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-Agent Reinforcement Learning: A Review of Challenges and Applications. Appl. Sci. 2021, 11, 4948. https://doi.org/10.3390/app11114948
Canese L, Cardarilli GC, Di Nunzio L, Fazzolari R, Giardino D, Re M, Spanò S. Multi-Agent Reinforcement Learning: A Review of Challenges and Applications. Applied Sciences. 2021; 11(11):4948. https://doi.org/10.3390/app11114948
Chicago/Turabian StyleCanese, Lorenzo, Gian Carlo Cardarilli, Luca Di Nunzio, Rocco Fazzolari, Daniele Giardino, Marco Re, and Sergio Spanò. 2021. "Multi-Agent Reinforcement Learning: A Review of Challenges and Applications" Applied Sciences 11, no. 11: 4948. https://doi.org/10.3390/app11114948
APA StyleCanese, L., Cardarilli, G. C., Di Nunzio, L., Fazzolari, R., Giardino, D., Re, M., & Spanò, S. (2021). Multi-Agent Reinforcement Learning: A Review of Challenges and Applications. Applied Sciences, 11(11), 4948. https://doi.org/10.3390/app11114948