Multi-Agent Reinforcement Learning: A Review of Challenges and Applications
:1. Introduction
2. Background
2.1. Multi-Agent Framework
2.1.1. Markov Decision Process
- S is the state space;
- A is the action space;
- is the transition probability from state to given the action ;
- is the reward function, whose value is the reward received by the agent for a transition from the state–action pair to the state ;
- is the discount factor and is a parameter used to compensate for the effect of instantaneous and future rewards.
2.1.2. Markov Game
- is the set of agents;
- S is the space observed by all agents;
- is the action space of the i-th agent and is called the joint action space;
- is the transition probability to each state given a starting state and a joint action ;
- is the reward function of the i-th agent representing the instantaneous reward received, transitioning from to ;
- is the discount factor.
2.1.3. A Partially-Observable Markov Decision Process
2.1.4. Dec-POMDP
- I is the set of n agents;
- S is the state space;
- is the joint action space;
- is the observation space with .
2.2. Single-Agent RL Algorithms
- The critic, which has the task of estimating the value function, typically using TD methods;
- The actor, which represents the parameterized policy and updates its action distribution in the direction "suggested" by the critic using a policy gradient.
2.2.1. Q-Learning
- Initialize the policy parameters at random;
- Use to generate a trajectory, which is a sequence of states, actions and rewards, ;
- For each time-step ,
- Estimate the return ;
- Update policy parameters using Equation (12);
- Iterate the process.
2.2.3. A3C
Algorithm 1 A3C Pseudocode [20]. |
3. The Limits of Multi-Agent Reinforcement Learning
3.1. Nonstationarity
Varying Learning Speed
3.2. Scalability
Deep Reinforcement Learning
3.3. Partial Observability
3.3.1. Centralized Learning of Decentralized Policies
Algorithm 2 COMA pseudocode [39]. |
3.3.2. Communications between Agents
4. Benchmark Environments for Multi-Agent Systems
5. Applications
6. Conclusions
Algorithm | Type of Agent | Learning Structure | Features | Scientific Spreading |
Hysteretic Q-Learning [26] | Value based | Independent learners | Uses a different learning rate for increasing and decreasing Q-values. No need for communication between agents | 1100 |
Lenient Q-Learning [27] | Value based | Independent learners | Accumulates rewards for a state–action-pair and then update it using the maximum reward | 333 |
QD-learning [28] | Value based | Networked agents | Receives the Q-values from agents in the proximity with the objective of minimizing the difference and reaching consensus | 104 |
Actor–Critic with Networked Agents [29] | Actor–critic | Networked agents | Both the policy of the actor and the Q-estimates of the critic are parameterized. The agents share the parameters of their critic to reach consensus | 3050 |
Lenient deep Q-network [32] | Value based | Independent agents | Stores a temperature variable in the experience replay to decide the amount of leniency to apply to the updates | 97 |
Multi-Agent Deep Q-Network [33] | Value based | Independent agents | Use of importance sampling and low-dimensional fingerprints to disambiguate samples in the experience replay | 3380 |
Dec-HDQRN [35] | Value based | Independent agents | Integrates the recursive neural network to estimate the non observed state and hysteretic Q-learning to address non stationarity. Possibility to use transfer learning to adapt to multi task applications | 49 |
PS-TRPO [36] | Policy optimization | Centralized training | Shares parameters between agents during training. The policy parameters are bounded to change in a trust region. Can scale progressively to more agents using curriculum learning | 23 |
COMA [39] | Actor–critic | Centralized training | A centralized critic is used only during the training phase. Can differentiate rewards between agents using a conterfactual baseline | 4750 |
VDN [40] and QMIX [41] | Value based | Networked agents | The Q-table of the joint action can be factorized as a sum or a combination of independent Q-Tables | 847 |
CommNet [42] | Policy optimization | Networked agents | The agents communicate for a number of rounds before selecting their action. The communication protocol is learned concurrently with the optimal policy | 319 |
DDRQN [43] | Value based | Networked agents | Uses a deep recursive network architecture in a partial observable setting with the use of parameter sharing to speed-up the learning process | 3510 |
Q-RTS [24] | Value based | Centralized training | Agents create a global knowledge Q-matrix combining their most valuable experiences and make updates on a linear combination of the matrix and their local Q-table. | 108 |
Sector | Applications |
UAVs | Drone field coverage [47] Target assignment and path planning [48] LoS Networks [50] Packet routing relaying [52] Recharging towers [53] |
Image processing | Joint active objective search [54] |
Energy sharing and scheduling | Zero-energy community [55] Residential microgrid with V2G exchange [56] Industry production control [58] Lithium battery lifetime optimization [64] |
Automotive | Safe driving [59] Fleet control for ride sharing platform [61] Intersection traffic light control [62] |
Social Science | Common Pool Resource Approximation [66] Sequential Social Dilemmas [67] |
Networking | Base-station parameter approximation [68] |
