Research
Open access
Published: 30 October 2024

Multi-agent Deep Reinforcement Learning for cloud-based digital twins in power grid management

Luyao Pei¹,
Cheng Xu¹,
Xueli Yin¹ &
…
Jinsong Zhang¹

Journal of Cloud Computing volume 13, Article number: 152 (2024) Cite this article

509 Accesses
Metrics details

Abstract

As the power industry becomes increasingly complex, Digital Twin (DT) technology has emerged as a crucial tool for enhancing grid resilience and operational efficiency by creating dynamic digital replicas of physical systems. These replicas enable accurate simulations and proactive management, but the vast amount of data generated by DT systems poses significant challenges for processing and analysis. Cloud computing offers a flexible solution by offloading these computational tasks to distributed resources, allowing for real-time analysis and scalable operations. However, this approach introduces complexities in task distribution and maintaining quality of service. Recent efforts have applied Deep Reinforcement Learning (DRL) to address these challenges, primarily using single-agent methods. However, these methods struggle with scalability and performance in increasingly complex cloud environments. To overcome these limitations, we propose an efficient task scheduling framework based on Multi-Agent Deep Q-Network (MADQN) principles, specifically designed to optimize both response times and operational costs. We provide a comprehensive design overview of our approach and conduct a thorough evaluation of its performance. The experimental results clearly indicate that our approach can considerably reduce response times amd lower operational costs compared to current methods, including state-of-the-art single-agent approaches.

Introduction

As the power industry evolves to meet growing challenges, technologies like smart grids and artificial intelligence have been adopted [1]. However, these approaches often struggle with integrating real-time data and predictive modeling. Digital Twin (DT) technology, which creates dynamic digital replicas of physical systems for accurate simulations and proactive grid management, is emerging as a key solution [2]. Unlike conventional tools, DT provides a dynamic, real-time simulation environment where various scenarios can be tested and optimized.

With the growing complexity of modern power grids, implementing DT technology presents significant challenges. As DT models interact with their physical systems, they generate vast amounts of data that require rapid processing and analysis to ensure the full benefits of DT technology. Cloud computing, with its flexible resource utilization, offers a robust solution to managing the massive data streams produced by the power grid DT systems [3]. As demonstrated in Fig. 1, in a cloud-based digital twin framework for power grids, tasks originating from both the power grid and the DT are delegated to the cloud for processing. By offloading computationally intensive tasks to a cloud, the power grid gains enhanced processing capabilities without the need for extensive on-site infrastructure. This approach not only enables real-time data analysis but also provides the scalability needed to accommodate growing data demands.

To enhance the performance of task processing in cloud environments, it is crucial to allocate tasks to computing resources efficiently. Traditional algorithms, which have been primarily effective for batch processing, often struggle with the unpredictability and demands of real-time operations [5], rendering them unsuitable for DT scenarios. As artificial intelligence has advanced, researchers have explored more adaptive solutions. A significant approach is Deep Reinforcement Learning (DRL), which offers a distinct advantage by continuously learning from the environment to make real-time decisions [6]. This real-time adaptability sets DRL apart, making it a promising approach for managing the complexities of modern cloud computing environments where traditional methods may falter [7].

Currently, deep reinforcement learning (DRL) has become widely employed for addressing task scheduling challenges in cloud computing [8]. Moreover, the combination of DRL with DT technology has proven effective in various fields, including vehicular networks [9] and satellite edge networks [10], enhancing real-time decision-making and resource management. In the context of power grid management, although recent work has adopted DRL [4], it has typically implemented single-agent DRL models, which have notable limitations. Specifically, single-agent methods in distributed systems often struggle with scalability and efficiency, posing significant challenges in managing the diverse and simultaneous tasks generated by power grids.

To address these challenges, this work introduces an efficient scheduling method called GD-MA, designed to effectively manage the diverse array of tasks generated by power grids and their DT systems, thereby delivering timely outcomes that enhance decision-making processes. Specifically, GD-MA improves upon traditional Q-Learning and Deep Q-Networks (DQN) by addressing scalability challenges in real-time task scheduling for cloud-based DT systems. Using Multi-Agent Deep Q-Network (MADQN), GD-MA intelligently allocates tasks across multiple compute nodes, considering their specific characteristics and processing requirements. This multi-agent approach optimizes resource utilization and enhances response times, particularly in dynamic environments. By outperforming single-agent models in scalability and efficiency, GD-MA provides faster response times and more efficient resource utilization, making it a robust and cost-effective solution for power grid management.

In summary, the key contributions of our work are as follows:

We introduce a task scheduling approach called GD-MA, leveraging Multi-Agent Deep Reinforcement Learning (MADRL), to minimize response times and reduce overall computational costs for cloud-based digital twins in power grid management.
We provide a comprehensive design of GD-MA, featuring an in-depth mathematical model and a detailed implementation of our approach.
We compare our method with existing approaches, and our experimental results demonstrate that GD-MA outperforms other methods, including the state-of-the-art single-agent approach.

The remainder of this paper is structured as follows. First, we review relevant literature. We outline the proposed system architecture. We then detail our MADRL approach and assess its performance through experimental evaluation. Finally, we conclude by summarizing our findings.

Related work

Digital Twin (DT) technology has become a cornerstone in modern industry, offering the capability to create detailed digital replicas of physical systems for a variety of purposes, including simulation, analysis, and optimization [11]. Originally focused on enhancing smart manufacturing by merging digital and physical realms, DT technology has now expanded its reach to areas like autonomous vehicles [12] and modular manufacturing systems [13], where it plays a key role in improving efficiency and reducing processing times. For instance, DT models have been pivotal in refining offloading strategies and optimizing subchannel allocation, thereby boosting computational performance in these sectors [14]. Similarly, DT frameworks have been applied to manage electrical devices, ensuring more reliable and efficient communication at the network edge [15]. In the realm of battery management, DT technology integrated with cloud platforms has drastically enhanced data processing capabilities and overall system efficiency [16]. In smart grids, DT has been utilized to predict the remaining operational life of equipment, underscoring its importance in maintaining grid stability [17]. Despite these advancements, the sheer volume of tasks generated by DT systems, particularly in power grids, presents a significant challenge. While local processing may struggle to keep pace, leveraging cloud computing for task management emerges as a superior approach. To fully harness the potential of DT in power grids, an effective cloud-based task scheduling system is crucial, as it ensures optimal resource use and enhances operational efficiency.

While task scheduling in cloud computing has seen considerable progress, most current strategies remain focused on optimizing batch processing, which leaves them ill-suited for the challenges of real-time workloads. For example, the Enhanced Multi-Verse Optimizer (EMVO) algorithm [18] is effective at minimizing makespan and improving resource utilization, but it excels mainly in static, predictable environments. Similarly, the reengineered Henry Gas Solubility Optimization (HGSO) algorithm, combining elements from the Whale Optimization Algorithm (WOA) and Comprehensive Opposition-Based Learning (COBL) [19], enhances scheduling efficiency, yet its application is limited when real-time adaptability is essential. The SAEA algorithm, based on fuzzy logic [20], addresses multi-objective challenges like energy efficiency, load balancing, and security, but is more suited to controlled, non-real-time scenarios. These methods, while effective in specific contexts, often lack the flexibility and responsiveness needed for dynamic, real-time task management. This highlights a critical gap in current cloud computing task scheduling approaches, emphasizing the need for new strategies capable of meeting the demands of real-time processing with the required speed and adaptability.

As artificial intelligence continues to advance, learning-based approaches are increasingly employed to tackle complex tasks. Among these approaches, the fusion of DT technology with Deep Reinforcement Learning (DRL) is rapidly emerging as a key factor in driving advancements across multiple sectors. For example, in vehicular networks, a DRL-based service offloading (SOL) approach has been developed to address the limitations of vehicular computing resources [21]. Moreover, in the domain of mobile edge computing, researchers have pioneered an advanced task offloading system that leverages DRL and DT technology [22]. In addition, DT methods are utilized to capture real-time data, while DRL strategies are implemented to enhance task scheduling by concurrently reducing latency and conserving energy in in-vehicle edge cloud settings [23]. In power grid management, the combination of DT and deep Q-learning has significantly enhanced performance [24]. The capabilities of Deep Neural Networks (DNN) for perception, combined with Reinforcement Learning (RL) for decision-making, effectively address various optimization challenges [25]. The work [4] proposes a real-time task scheduling method for power grid DT systems using DRL, to optimize processing time and cost in cloud computing environments. However, scaling Q-learning to manage high-dimensional tasks remains a substantial challenge, with DQN often encountering the issue of Q-value overestimation [26].

The DQN struggles in complex environments due to non-stationarity and lack of inter-agent coordination. In contrast, MADQN addresses these issues by using centralized training with decentralized execution to stabilize learning and incorporating communication protocols for effective coordination, leading to improved performance in complex multi-agent scenarios. MADRL methods have also been effectively applied in various other fields. For example, in vehicle systems, MADRL has been applied to enhance coordinated driving and traffic optimization in multi-agent scenarios, improving both efficiency and safety [27]. In smart manufacturing, MADRL has been utilized to optimize task offloading in distributed manufacturing systems, reducing latency and improving coordination across multiple edge and cloud nodes [28]. Additionally, in wireless sensor networks, MADRL has been applied to optimize energy harvesting and data freshness in UAV-assisted systems, improving energy efficiency and task coordination [29]. In our paper, we utilize MADRL for cloud-based DT in pwoer grid management.

Cloud-based power grid digital twin system

This section provides a detailed analysis of cloud-based task processing in power grid DT systems, explaining how tasks generated by the power grid and its DT are routed to cloud nodes for processing.

System framework

Figure 2 provides a detailed depiction of how task scheduling is managed within cloud nodes for a power grid digital twin system. Typically, when a computing job from the DT system arrives, the scheduler assigns it to a suitable instance, where it is then placed in the queue. The task waits in line within the computing node’s queue until it is executed. Once the task is completed, the node is ready to receive a new assignment. To comprehensively tackle the optimization challenges examined, we establish mathematical constructs concerning the task, computational node, and task execution processes. The corresponding symbols and terminologies are detailed in Table 1.

Table 1 Notations in proposed approach

Full size table

System model

Task Model. Contrary to other DT tasks, the power grid system derives its data from a multitude of varied origins, including sensors, measurement devices, and monitoring systems. This assortment necessitates focused attention on the unique types of tasks arising from the power grid and DT systems, including data interpretation, log oversight, and image analysis. Moreover, the constantly shifting dynamics of the power grid necessitate unwavering real-time. It is essential for tasks to promptly detect and adjust to variations in the grid’s conditions, thereby enabling immediate monitoring, forecasting, and informed decision-making. As a result, tasks can be uploaded at any time, with each assigned a specific QoS time requirement and appropriately characterized. Similar to [4], the task can be modeled as: $Task_{i}=\{K^{ID}_{i}, K^{AT}_{i}, K^{S}_{i}, K^{QoS}_{i}, K^{T}_{i}\}$. Here, $K^{ID}_{i}$ represents the task ID, $K^{AT}_{i}$ denotes the task upload time, $K^{S}_{i}$ indicates the task size, $K^{QoS}_{i}$ specifies the QoS, and $K^{T}_{i}$ identifies task type.

Computing Nodes Model. Within cloud platforms, computing nodes function as the primary computational building blocks, granting users the capacity to lease and customize resources in alignment with their individual requirements. To resolve the issue at hand, we adopt a pay-as-you-go approach. The distinct features of each available node are detailed as: $Node_{j}=\{D^{ID}_{j}, D^{T}_{j}, D^{C}_{j}, D^{P}_{j}, D^{I}_{j}\}$ In this context, $D^{ID}_{j}$ represents computing node ID, while $D^{T}_{j}$ denotes computing node type. The idle time of node is indicated by $D^{I}_{j}$, and $D^{C}_{j}$ specifies its processing capacity. Additionally, $D^{P}_{j}$ denotes price for processing, which correlates with execution time.

Scheduling Model. We propose that upon the generation of a task by the power grid and DT system, the scheduler promptly dispatches it to a designated node. After assignment, the task is enqueued in the processing lineup of the designated node, where it waits alongside other tasks already positioned for execution. This queue operates on a first-come, first-served (FCFS) basis, ensuring that tasks are processed in the order they arrive. It is further assumed that once a task begins its processing, it cannot be interrupted by any other tasks, maintaining a non-preemptive execution environment. Furthermore, there is no cap on the number of tasks that can be stored in the queue of a node. The response time for task completion is determined by summing the task’s transmission time, the total time required for processing, and the time spent waiting in the compute node’s queue. As a result, response time $T_{i}$ is represented as:

$$\begin{aligned} T_{i}=T^{wait}_{i}+T^{exe}_{i} \end{aligned}$$

(1)

Here, $T^{exe}_{i}$ represents the task’s processing time, while $T^{wait}_{i}$ indicates waiting time before the task is processed. In the meantime, execution time $T^{exe}$ is defined as: $T^{exe}_{i}=\beta *\frac{K^{S}_{i}}{D^C_{j}}$. In this context, $K^{S}_{i}$ represents the size of the task, while $D^{C}_{j}$ denotes the processing capacity, $\beta$ signifies the speedup ratio for different tasks executed across various computing nodes.

When a task arrives at its designated computing node, it is executed immediately if the node is idle. However, if the node is already engaged with another task, the new task must wait until the current one is completed before it can begin processing. The waiting period, denoted as $T^{wait}_{i}$, represents the time task i must wait before processing. Here, $D^{I}_{j}$ represents the node’s idle time, while $K^{AT}_{i}$ is the task’s upload time. If the node is occupied, the waiting period is calculated as the difference between the node’s idle time and the task’s upload time, given by $T^{wait}_{i} = D^I_{j} - K^{AT}_{i}$. Conversely, if the node is already idle, the task can be processed immediately, resulting in no waiting time, hence $T^{wait}_{i} = 0$.

Following the definition in the latest work [4], a task is considered successfully processed if it satisfies the QoS criteria. The specific conditions for successful task execution as:

$$\begin{aligned} success=\left\{ \begin{array}{cc} 1 & if\ T_i < K^{QoS}_{i}\\ 0 & else \end{array}\right. \end{aligned}$$

(2)

Here, $T_{i}$ refers to response time, while $K^{QoS}_{i}$ indicates the QoS requirement. The expense of processing a task is largely determined by how long it takes to execute. By minimizing the execution time, the cost associated with processing each task can be significantly reduced. The relationship between execution time and processing cost is detailed as follows.

$$\begin{aligned} cost_{i}=D^{P}_{j}*T^{exe}_i \end{aligned}$$

(3)

In this context, $cost_{i}$ represents the task’s processing cost, $D^{P}_{j}$ denotes the price of node, and $T^{exe}_{i}$ indicates time needed to complete task.

The proposed multi-agent approach

In this section, we provide a detailed explanation of the fundamentals of MADQN and elaborate on the specifics of the proposed approach. We also scrutinize the action space, state space, and reward function.

Basics of multi-agent Deep Reinforcement Learning

MDP. The Markov Decision Process (MDP) provides fundamental framework for modeling decision-making in environments where outcomes influenced by both random events and the choices made by a decision maker. MDPs are particularly effective in solving complex optimization problems, utilizing dynamic programming and reinforcement learning to determine optimal strategies. An MDP is represented by a 5-tuple $(S, A,P_a(s, s'), R_a(s, s'), and \gamma )$, where S represents the possible states, and A is available actions set, $P_a(s, s')=Pr(s_{t+1}=s|s_t=s, a_t=a)$ specifies probability that action a taken in state $s_t$ will result in state $s_{t+1}$, $R_a(s, s')$ indicates reward obtained after transitioning, and $\gamma \in [0, 1]$ is the discount factor that prioritizes immediate rewards over future ones.

The fundamental goal of an MDP is to determine the most effective policy that optimizes the total reward over a series of decision-making steps. This cumulative reward, commonly known as the expected return, is calculated by summing the rewards obtained from each action, with future rewards discounted to reflect their reduced importance. In reinforcement learning, ensuring that the system adheres to the Markov property involves creating a state transition matrix that captures the likelihood of moving from one state to another.

DQN. Q-learning is a reinforcement learning approach that operates independently of a model and shares similarities with the MDP framework. It works by estimating the expected rewards for each action within a specific state, without requiring knowledge of future outcomes. The algorithm uses a Q-table to store these estimates, called Q-values, which guide the decision-making process. In each state s, the algorithm looks at the Q-values for all possible actions and chooses the action with the highest value. The Q-values are updated over time based on the rewards received and the outcomes of actions, using a learning rate $\alpha$ and a discount factor $\gamma$. This updating continues until the Q-values become optimal, meaning they accurately predict the best actions to take. The Q-values are updated as: $Q(s,a) \leftarrow Q(s,a) + \alpha \left[ R + \gamma {\max }_{a'} Q(s',a') - Q(s,a)\right]$. In this formula, R is the reward received after moving to the next state s’, and ${\max }_{a'} Q(s',a')$ is the highest Q-value for that next state.

DQN advances the Q-learning framework by utilizes a neural network to approximate the Q-value function, enabling it to generalize across a broader range of states and actions, especially in complex environments where the Q-table approach is impractical. In DQN, the neural network processes the state s as input and produces Q-values corresponding to every possible action a. The network is adjusted to minimize the gap between the estimated Q-values and the target Q-values. The target Q-value is derived from the reward R and the highest Q-value of the next state $s'$, as articulated by the Bellman equation: $Q(s,a)=R+\gamma {\max }_{a'}Q(s',a')$. To train the neural network, a loss function is employed, usually expressed as the mean squared error (MSE) between the estimated Q-values $Q(s,a;\theta )$ and the target Q-values, where $\theta$ signifies the network’s parameters. The target Q-values are computed as follows: $y=R+\gamma {\max }_{a'}Q(s',a';\theta ^-)$.

To ensure stable training, DQN incorporates two fundamental techniques: experience replay and the use of a distinct target network. Experience replay involves capturing and storing the agent’s transitions $(s,a,R,s')$ in a replay buffer. The network is subsequently trained by randomly drawing mini-batches from this buffer, which disrupts the correlation between sequential samples and enhances learning effectiveness. The target network, a separate network created for calculating target Q-values, mirrors the Q-network but is updated less frequently, reducing the likelihood of oscillations and preventing divergence during training.

MADQN. MADQN extends the DQN framework to environments with multiple interacting agents, where each agent must learn optimal policies in the presence of other agents. MADQN leverages the principles of Q-learning and DQN, adapted to the multi-agent context, allowing agents to collaboratively or competitively learn their behaviors. In a multi-agent environment with n agents, state space S and action space A are represented as follows:

$$\begin{aligned} S=S_1\times S_2\times \cdots \times S_n \end{aligned}$$

(4)

$$\begin{aligned} A=A_1\times A_2\times \cdots \times A_n \end{aligned}$$

(5)

where $S_i$ and $A_i$ represent state and action spaces of i-th agent. State s includes the states of all agents, and action a includes the actions of all agents. Each agent i maintains its own Q-function $Q_i(s,a)$, which estimates the expected return. The objective is to seek the optimal strategy $\pi _i$ which maximizes Q-value.

MADQN employs two key techniques to stabilize training: experience replay and a separate target network for each agent. Each agent i stores its experiences $(s,a,R_i,s'$) in replay buffer. Mini-batches are randomly sampled from buffer to train the Q-network, weakening the ties between successive experiences and bolstering overall stability. Loss function for each agent i can be expressed as:

$$\begin{aligned} L_i(\theta _i)=\mathbb {E}_{(s,a,R_i,s')\sim \mathcal {D}}\left[ \left( R_i+\gamma \underset{a'}{\max }\ Q_i(s',a';\theta _i^-)-Q_i(s,a;\theta _i)\right) ^2\right] \end{aligned}$$

(6)

where $\theta _i$ represents parameters of Q-network for agent i, $\theta _i^-$ represents parameters of the target network. Moreover, each agent i keeps an independent target network to generate target Q-values. The parameters of this target network are periodically aligned with those of the Q-network, which helps in minimizing oscillations and averting divergence during the training phase.

MADQN is versatile enough to handle both collaborative and competitive scenarios. In collaborative settings, agents cooperate to enhance a common reward by aligning their actions to secure the most advantageous collective result. In competitive settings, agents aim to maximize their individual rewards, which may involve minimizing the rewards of other agents, learning strategies to outmaneuver their opponents. MADQN extends the Q-learning and DQN algorithms to multi-agent environments, enabling agents to learn optimal policies in the presence of other agents. Through the application of experience replay and target networks, MADQN promotes consistent and efficient learning. It is adaptable to various scenarios, including both collaborative and competitive settings, making it a powerful approach for complex multi-agent systems.

The proposed GD-MA approach

This section, we introduce the characteristics and advantages of our GD-MA approach. Then, we delve into the specifics of state space, action space, reward mechanism, mathematical framework, and training methodology.

MADRL-based approach

The GD-MA method is primarily designed for optimizing real-time task scheduling in cloud-based DT systems. Its purpose is to ensure efficient coordination of tasks across multiple computing nodes, particularly in complex environments like power grids. By leveraging a multi-agent framework, GD-MA enables dynamic task allocation, optimizing resource utilization and ensuring more efficient scheduling processes. Unlike single-agent models, which often struggle with the diversity and concurrency of tasks, GD-MA significantly improves scalability by intelligently distributing tasks across the entire system. This approach is particularly beneficial for power grid management, where the complexity and variety of tasks demand a flexible and scalable solution. Furthermore, the use of MADRL allows GD-MA to adapt to the unique requirements of cloud-based DT systems, making it a more robust and efficient method compared to traditional approaches. Its adaptability ensures that GD-MA can effectively meet the demands of dynamic environments, offering improved performance across multiple tasks and workloads.

Action space

We consider a scenario with a fixed number of cloud computing nodes, each featuring a task queue that is managed by a scheduler. The scheduler is responsible for allocating tasks to the appropriate queues. Each agent schedules a computing node, so the action space is defined as whether a task joins the waiting queue of the node for which the agent makes a decision. Thus, for each agent i, our action space $a_i$ is represented as $a_i=\left\{ 0,1\right\}$. If $a_i=1$, task is queued in waiting line of ith computation node. Conversely, if $a_i=0$, task is not queued in waiting line of ith computation node.

State space

State space encompasses both state of task and state of each computational node. The state space of GD-MA at time t can be represented as $S=S_{node} \cup S_{task}$. Specifically, $S_{node}$ refers to the status of the computing nodes, while $S_{data}$ represents the condition of current task. The state space is specifically defined follows:

$$\begin{aligned} S=\left\{ AT_t, T_t, D_1^t, D_2^t, \ldots , D_m^t\right\} \end{aligned}$$

(7)

In this scenario, $AT_t$ refers to the time taken to upload the current task, while $T_t$ specifies the task type. $N_1^t$ reflects the time the current task remains queued at the $N_{th}$ node.

Reward function

To fulfill QoS requirements while simultaneously minimizing response time and costs, the cost-efficiency per task and the average response time are pivotal elements in calculating the reward. As a result, the reward function for GD-MA is constructed as follows:

$$\begin{aligned} r=\left\{ \begin{array}{cc} -(\lambda _1 * T_i + \lambda _2 * cost_i) & if\ success = 1\\ -\lambda _3 & else \end{array}\right. \end{aligned}$$

(8)

Here, $T_i$ denotes response time. $cost_i$ refers to the cost associated with the current task, where lower values lead to higher rewards. The factors $\lambda _i$ act as trade-offs to balance the effects of cost and response time. Increasing $\lambda _2$ shifts the focus toward minimizing response time, while a higher $\lambda _1$ places greater emphasis on reducing cost. It’s important to note that $\lambda _1 + \lambda _2 = 1$, with $\lambda _3$ remaining constant.

Model training period

Algorithm 1 presents an in-depth summary of the training methodology employed in the MADQN scheduling model. The process begins with the initialization of the Q-networks and target networks for each agent. The Q-network is initialized with random weights, ensuring stable Q-value targets during training. Each agent is assigned a replay buffer to store experiences, including the current state, action taken, reward received, and the next state. To achieve an optimal balance between exploration and exploitation, an exploration strategy, often $\epsilon$-greedy, is applied. Each agent randomly selects an action with a probability of $\epsilon$, while with a probability of $1-\epsilon$, it chooses the action that has the highest Q-value, allowing agents to discover new actions while leveraging known rewarding actions.

During training, each episode starts with initializing the state, representing the environment’s initial configuration. For each step within the episode, each agent determines its action in line with the policy it currently follows. Upon performing the joint action, the environment advances to a new state, with each agent earning a reward reflective of the action it executed. These interactions $(s,a,R,s')$ are recorded in the agents’ replay buffers. To update the Q-network, mini-batches of these recorded interactions are randomly selected from the replay buffer. Q-values are adjusted by minimizing a loss function that captures the gap between predicted and target Q-values, with target values derived through the Bellman equation. For stable learning, target networks are periodically aligned with the weights of Q-networks. This periodic alignment curbs oscillations and divergence, guiding the Q-values to converge toward the optimal action-value function over time. MADQN effectively handles both collaborative and competitive multi-agent environments, enabling agents to coordinate their actions for shared rewards or develop strategies to maximize individual rewards in competitive settings.

Experimental evaluation

In this section, we provide an in-depth examination of the experimental analysis conducted on the proposed technique, with comparison to existing methods, including the state-of-the-art single-agent DQN and Proximal Policy Optimization (PPO) approaches for power grid digital twins [4].

Experimental setup

In experiments, the MADQN method is set up with specific hyperparameters: training runs for 1,000 episodes, with the target network updated every 20 steps. The replay buffer is configured to hold 100,000 experiences, and the batch size is fixed at 256. A learning rate of 0.001 is applied, alongside a discount factor of 0.999. The method is evaluated against four widely used scheduling strategies: Random, Round-Robin, Earliest, DQN, and PPO. The implementation leverages the PyTorch framework, utilizing a GPU for both training and inference.

Similar to [4], this study involved configuring three distinct groups of cloud computing nodes, each tailored with specific attributes. Detailed information about these nodes is provided in Table 2. The nodes differ in terms of computational power and cost, reflecting their varied capabilities.

Table 2 Computing nodes types

Full size table

Experimental results

In this part, we perform convergence experiment of MADQN and related experiments on task size, task arrival rate, number of computing nodes, and task ratio, with the main metrics including average response time and overall cost.

Convergence experiment

In the convergence experiment, we observe the changes in response time, total cost, and reward with increasing number of episodes, and the convergence curves are illustrated in Fig. 3. Response time drops sharply during the early episodes, and then gradually stabilizes, indicating improved efficiency. The total cost also shows a sharp decline initially, followed by a more gradual reduction and eventual stabilization, reflecting optimized resource usage over time. The reward increases significantly during the early episodes and then plateaus, demonstrating the effectiveness of the learning algorithm in achieving higher rewards as it converges. These results indicate successful convergence, with the system becoming more efficient, cost-effective, and rewarding over the course of the training episodes.

Varying task size

We analyzed the capability of several approaches to handle a variety of average task sizes. Task sizes were varied between 200 and 1000, with the workload distribution comprising 50% data analysis, 30% log management, and 20% image processing. Figure 4 presents results, showing how each method performed regarding the average response time and cumulative cost across a range of task sizes. As shown in the Figure, the MADQN method obtains lowest average response time across different task sizes, followed by the PPO and DQN method. Conversely, the Random method exhibits the highest response time, especially as task size increases. Figure illustrates overall cost, where MADQN, DQN, and PPO methods also perform better, maintaining lower costs compared to other methods. The performance of the Round-Robin and Earliest methods is intermediate, with costs and response times higher than those of MADQN, DQN, and PPO but lower than the Random method. These findings highlight the efficiency of the MADQN approach in managing tasks of varying sizes, providing significant improvements in both response time and cost.

Varying task arrival rate

We examined the effectiveness of various approaches under fluctuating arrival rates, conducting experiments with average data arrival rates set between 10 and 30. As shown in Fig. 5, the results reveal that as the arrival rate increases, each method exhibits unique patterns. Specifically, MADQN method consistently outperforms others with respect to average response time across all arrival rates, achieving the lowest values. Conversely, the Random method shows the highest average response times. For overall cost, the MADQN method also demonstrates superior performance, maintaining the lowest cost values as the arrival rate escalates, while the Random method incurs the highest costs. These results illustrate its inefficiency in handling increasing workload pressures.

Varying computing nodes number

This experiment evaluated the effectiveness of various approaches in managing tasks across differing configurations of nodes. The node number is adjusted between 3 to 24. Figure 6 highlights that as node number increases, each method displays unique trends in both average response time and overall cost. Specifically, with a smaller node number, MADQN method demonstrates significantly lower average response times compared to the other methods. As the number of nodes increases to 12 and 24, all methods show a reduction in average response time, but MADQN continues to outperform the others, maintaining the lowest values. In terms of overall cost, MADQN consistently achieves lower costs across all node configurations, particularly excelling at higher node counts. The Random method, however, incurs the highest costs, particularly with fewer nodes. These results underscore the robustness and efficiency of MADQN method in optimizing both response time and cost across varying numbers of computing nodes, outperforming traditional methods.

Varying ratio of tasks

In this experiment, we assessed the performance of various methods under different task distributions, with the task ratios set for image processing, log processing, and data analysis at 2:1:1, 1:2:1, and 1:1:2, respectively. According to the results shown in Fig. 7, the GD-MA algorithm performs well in scheduling tasks in all three ratios, maintains a low average response time and cost compared to other methods. Specifically, in terms of average response time, the different methods show relatively small variations at different task ratios, indicating that our GD-MA method is able to maintain the efficiency and consistency of task allocation even under varying workload conditions. In terms of overall cost, GD-MA similarly compares favorably to other methods (including the advanced single-agent methods DQN and PPO) at different task ratios, demonstrating its ability to optimize resource utilization and reduce operational expenses under varying task loads.

Conclusion

In this paper, we introduced GD-MA, an advanced task scheduling framework tailored for optimizing grid operations within a cloud computing environment through digital twin technology. By leveraging Multi-Agent Deep Q-Network (MADQN) principles, GD-MA effectively addresses the challenges of real-time task allocation, focusing on minimizing response times and reducing operational costs at computing nodes. We presented the detailed design and implementation of our method, and our comprehensive analysis and experimental results show that GD-MA significantly outperforms existing methods, including the state-of-the-art single-agent approaches. As future works, we plan to extend GD-MA to more complex cloud-edge environments and address task privacy issues during processing in cloud settings.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Omitaomu OA, Niu H (2021) Artificial intelligence techniques in smart grid: A survey. Smart Cities 4(2):548–568
Article Google Scholar
Jiang Z, Lv H, Li Y, Guo Y (2022) A novel application architecture of digital twin in smart grid. J Ambient Intell Humanized Comput 13(8):3819–3835
Article Google Scholar
Sandhu AK (2021) Big data with cloud computing: Discussions and challenges. Big Data Min Analytics 5(1):32–40
Article Google Scholar
Qi D, Xi X, Tang Y, Zheng Y, Guo Z (2024) Real-time scheduling of power grid digital twin tasks in cloud via deep reinforcement learning. J Cloud Comput 13(1):121
Article Google Scholar
Abualigah L, Diabat A (2021) A novel hybrid antlion optimization algorithm for multi-objective task scheduling problems in cloud computing environments. Clust Comput 24(1):205–223
Article Google Scholar
Liu Q, Xia T, Cheng L, Van Eijk M, Ozcelebi T, Mao Y (2022) Deep reinforcement learning for load-balancing aware network control in iot edge systems. IEEE Trans Parallel Distrib Syst 33(6):1491–1502
Article Google Scholar
Cheng L, Wang Y, Cheng F, Liu C, Zhao Z, Wang Y (2023) A deep reinforcement learning-based preemptive approach for cost-aware cloud job scheduling. IEEE Trans Sustain Comput 9(3):422–432
Gu Y, Cheng F, Yang L, Xu J, Chen X, Cheng L (2024) Cost-aware cloud workflow scheduling using drl and simulated annealing. Digit Commun Netw https://doi.org/10.1016/j.dcan.2023.12.009
Chen Y, Gu W, Xu J, Zhang Y, Min G (2023) Dynamic task offloading for digital twin-empowered mobile edge computing via deep reinforcement learning. China Commun 20(11):164–175
Ji Z, Wu S, Jiang C (2023) Cooperative multi-agent deep reinforcement learning for computation offloading in digital twin satellite edge networks. IEEE J Sel Areas Commun 41(11):3414–3429
Javaid M, Haleem A, Suman R (2023) Digital twin applications toward industry 4.0: A review. Cogn Robot 3:71–92
Article Google Scholar
Tan X, Wang M, Wang T, Zheng Q, Wu J, Yang J (2024) Adaptive task scheduling in digital twin empowered cloud-native vehicular networks. IEEE Trans Veh Technol
Park KT, Son YH, Ko SW, Noh SD (2021) Digital twin and reinforcement learning-based resilient production control for micro smart factory. Appl Sci 11(7):2977
Article Google Scholar
Jeremiah SR, Yang LT, Park JH (2024) Digital twin-assisted resource allocation framework based on edge collaboration for vehicular edge computing. Futur Gener Comput Syst 150:243–254
Article Google Scholar
Liao H, Zhou Z, Liu N, Zhang Y, Xu G, Wang Z, Mumtaz S (2022) Cloud-edge-device collaborative reliable and communication-efficient digital twin for low-carbon electrical equipment management. IEEE Trans Ind Inform 19(2):1715–1724
Article Google Scholar
Li W, Rentemeister M, Badeda J, Jöst D, Schulte D, Sauer DU (2020) Digital twin for battery systems: Cloud battery management system with online state-of-charge and state-of-health estimation. J Energy Storage 30:101557
Article Google Scholar
Khan SA, Rehman HZU, Waqar A, Khan ZH, Hussain M, Masud U (2023) Digital twin for advanced automation of future smart grid. In: 2023 1st International Conference on Advanced Innovations in Smart Cities. IEEE Institute of Electrical and Electronics Engineers, Jeddah, pp 1–6. https://doi.org/10.1109/ICAISC56366.2023.10085428
Shukri SE, Al-Sayyed R, Hudaib A, Mirjalili S (2021) Enhanced multi-verse optimizer for task scheduling in cloud computing environments. Expert Syst Appl 168:114230
Article Google Scholar
Abd Elaziz M, Attiya I (2021) An improved henry gas solubility optimization algorithm for task scheduling in cloud computing. Artif Intell Rev 54(5):3599–3637
Article Google Scholar
Zade BMH, Mansouri N, Javidi MM (2021) SAEA: A security-aware and energy-aware task scheduling strategy by Parallel Squirrel Search Algorithm in cloud environment. Expert Syst Appl 176:114915
Article Google Scholar
Xu X, Shen B, Ding S, Srivastava G, Bilal M, Khosravi MR, Menon VG, Jan MA, Wang M (2020) Service offloading with deep q-network for digital twinning-empowered internet of vehicles in edge computing. IEEE Trans Ind Inform 18(2):1414–1423
Article Google Scholar
Zhang Y, Hu J, Min G (2023) Digital twin-driven intelligent task offloading for collaborative mobile edge computing. IEEE J Sel Areas Commun 41(10):3034–3045. https://doi.org/10.1109/JSAC.2023.3310058
Article Google Scholar
Zhu L, Tan L (2024) Task offloading scheme of vehicular cloud edge computing based on digital twin and improved a3c. Internet Things 26:101192
Article Google Scholar
Zhou Z, Jia Z, Liao H, Lu W, Mumtaz S, Guizani M, Tariq M (2021) Secure and latency-aware digital twin assisted resource scheduling for 5g edge computing-empowered distribution grids. IEEE Trans Ind Inform 18(7):4933–4943
Article Google Scholar
Cho C, Shin S, Jeon H, Yoon S (2020) Qos-aware workload distribution in hierarchical edge clouds: A reinforcement learning approach. IEEE Access 8:193297–193313
Article Google Scholar
Chen X, Yu Q, Dai S, Sun P, Tang H, Cheng L (2023) Deep reinforcement learning for efficient iot data compression in smart railroad management. IEEE Internet Things J 11(15):25494–25504
Chu T, Wang J, Codecà L, Li Z (2019) Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans Intell Transp Syst 21(3):1086–1095
Article Google Scholar
Xiong J, Guo P, Wang Y, Meng X, Zhang J, Qian L, Yu Z (2023) Multi-agent deep reinforcement learning for task offloading in group distributed manufacturing systems. Eng Appl Artif Intell 118:105710
Article Google Scholar
Betalo ML, Leng S, Abishu HN, Seid AM, Fakirah M, Erbad A, Guizani M (2024) Multi-agent drl-based energy harvesting for freshness of data in uav-assisted wireless sensor networks. IEEE Trans Netw Serv Manag. https://doi.org/10.1109/TNSM.2024.3454217

Download references

Acknowledgements

We express our gratitude to the authors of [4] for providing the basic materials and offering valuable insights that contributed to our work.

Funding

This work was funded by Energy Development Research Institute, China Southern Power Grid under the project of Research on the Depth and Workload Assessment of Typical Modeling Content for Power Grid Digital Twins (Main Grid 110kV and Above) with grant No. EDRI-PS-ZLYJ-2023-103.

Author information

Authors and Affiliations

Energy Development Research Institute, China Southern Power Grid, Guangzhou, China
Luyao Pei, Cheng Xu, Xueli Yin & Jinsong Zhang

Authors

Luyao Pei
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xueli Yin
View author publications
You can also search for this author in PubMed Google Scholar
Jinsong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Luyao Pei: Conceptualization, Writing - original draft. Cheng Xu: Conceptualization, Methodology, Writing - review & editing. Xueli Yin: Methodology, Writing- review & editing. Jinsong Zhang: Methodology, Writing - review & editing.

Corresponding author

Correspondence to Cheng Xu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Pei, L., Xu, C., Yin, X. et al. Multi-agent Deep Reinforcement Learning for cloud-based digital twins in power grid management. J Cloud Comp 13, 152 (2024). https://doi.org/10.1186/s13677-024-00713-w

Download citation

Received: 24 August 2024
Accepted: 09 October 2024
Published: 30 October 2024
DOI: https://doi.org/10.1186/s13677-024-00713-w

Multi-agent Deep Reinforcement Learning for cloud-based digital twins in power grid management

Abstract

Introduction

Related work

Cloud-based power grid digital twin system

System framework

System model

The proposed multi-agent approach

Basics of multi-agent Deep Reinforcement Learning

The proposed GD-MA approach

MADRL-based approach

Action space

State space

Reward function

Model training period

Experimental evaluation

Experimental setup

Experimental results

Convergence experiment

Varying task size

Varying task arrival rate

Varying computing nodes number

Varying ratio of tasks

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords