CN115169957A

CN115169957A - Power distribution network scheduling method, device and medium based on deep reinforcement learning

Info

Publication number: CN115169957A
Application number: CN202210893449.XA
Authority: CN
Inventors: 陈铭; 刘刚刚; 侯凯; 马顺; 阮楠千; 许银亮; 梅诗妍; 曾瑜; 胡晋岚; 孙罡; 姜玉梁; 周妍; 秦燕; 秦万祥; 赵芳菲
Original assignee: Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd
Priority date: 2022-07-27
Filing date: 2022-07-27
Publication date: 2022-10-11

Abstract

The invention provides a method, a device and a medium for dispatching a power distribution network based on deep reinforcement learning, wherein the method comprises the following steps: constructing operation constraints and cost functions corresponding to a plurality of devices, constraints and cost functions of electric energy transactions between the power distribution network to be scheduled and a main network, and risk constraints of node voltages and branch power of the power distribution network, respectively, on the power distribution network to be scheduled, and obtaining a scheduling model of the power distribution network to be scheduled; acquiring a state variable, an action variable and a reward function and constructing a Markov decision process; training a strategy network corresponding to the Markov decision process through a SAC algorithm in combination with basic data; and scheduling the power distribution network to be scheduled based on the output of the trained strategy network. Compared with the prior art, the method has the advantages that a Markov decision process is built, the strategy network trained through the SAC algorithm can adapt to online operation and complex calculation, millisecond-level rapid calculation is achieved, and generalization capability is obviously improved.

Description

Power distribution network scheduling method, device and medium based on deep reinforcement learning

Technical Field

The invention relates to the field of power systems, in particular to a power distribution network scheduling method, device and medium based on deep reinforcement learning.

Background

The risk assessment of the power system is a static safety analysis method integrating the probability and the severity of the operation state, and can quantitatively reflect the operation safety of the system. However, in the prior art, uncertainty of renewable energy power generation and load is not considered in the power distribution network scheduling based on risk assessment, and secondly, because the calculation involved in the scheduling process is highly non-convex and is difficult to express explicitly, the conventional method is difficult to solve, and the generalization capability of the solved result is poor.

Disclosure of Invention

The invention provides a power distribution network scheduling method, device and medium based on deep reinforcement learning, and aims to solve the technical problem of poor generalization capability in the prior art.

In order to solve the technical problem, an embodiment of the present invention provides a power distribution network scheduling method based on deep reinforcement learning, including:

constructing operation constraints and cost functions corresponding to a plurality of devices respectively for a power distribution network to be scheduled, constructing constraints and cost functions corresponding to electric energy transactions of the power distribution network to be scheduled and a main network, constructing node voltage and branch power risk constraints of the power distribution network to be scheduled, and obtaining a scheduling model (further, an economic scheduling model) of the power distribution network to be scheduled;

acquiring a state variable, an action variable and a reward function of the scheduling model, and constructing a Markov decision process for the scheduling model based on the state variable, the action variable and the reward function;

training a strategy network corresponding to the Markov decision process by a SAC algorithm in combination with basic data in the Markov decision process;

and scheduling the power distribution network to be scheduled based on the output of the trained strategy network.

As a preferred scheme, the training of the policy network corresponding to the markov decision process by the SAC algorithm specifically includes:

updating parameters of the SAC algorithm through an ASAM algorithm and a PER algorithm, and training an intelligent agent and a strategy network corresponding to the Markov decision process through the updated SAC algorithm; wherein the parameters of the SAC algorithm comprise soft Q network parameters, temperature coefficients and network parameters of the policy network.

Preferably, the plurality of devices comprise not less than one diesel engine set and not less than one energy storage system;

the operation constraint of the diesel engine set is as follows:

wherein,

for the active power output of the ith diesel engine set in the power distribution network to be scheduled in the period t,P _i ^G for the minimum active power of the ith diesel engine set of the power distribution network to be dispatched,

for the maximum active power of the ith diesel engine set of the power distribution network to be dispatched,

for the set of all nodes connected to the diesel cluster,

is a set of all time periods in the scheduling cycle;

the cost function of the diesel engine set is as follows:

wherein,

for the sum of the fuel costs of all the diesel fuel units in the distribution network to be scheduled during the period t,

is the sum of the carbon emission costs of all the diesel engine sets in the power distribution network to be scheduled in the period of t, a _G,i 、b _G,i And c _G,i For the fuel cost factor of the i-th diesel unit, d _G,i And e _G,i Is the carbon emission cost coefficient of the ith diesel engine set.

Preferably, the operation constraints of the energy storage system are as follows:

wherein,

for the active output of the ith energy storage system in the power distribution network to be scheduled in the period of t,

the maximum charging power of the ith energy storage system in the power distribution network to be dispatched is obtained,

for the maximum discharge power of the ith energy storage system in the power distribution network to be scheduled,

set, SOC, of all nodes connected with an energy storage system in the power distribution network to be scheduled _i,t For the time period t, the charge state of the ith energy storage system in the power distribution network to be scheduled,SOC _i,t for the t time period, the ith energy storage system in the power distribution network to be dispatchedThe minimum state of charge allowed for is,

the maximum charge state, eta, allowed by the ith energy storage system in the period t in the power distribution network to be scheduled _C Charging power, η, for energy storage systems _D For discharge power of energy storage systems, E _i The capacity of the ith energy storage system in the power distribution network to be dispatched is obtained;

the cost function of the energy storage system is:

wherein,

is the sum of the charge and discharge costs, a, of all stored energy of the power distribution network to be scheduled in the period of t _E,i And the cost coefficient of the ith energy storage system in the power distribution network to be dispatched is obtained.

As a preferred scheme, the cost function of the transaction between the power distribution network to be scheduled and the main network electric energy is as follows:

wherein,

the cost P of purchasing electricity from the main network for the power distribution network to be scheduled in the period t _t ^M >0 is t time period, the power of the power distribution network to be dispatched purchasing power from the main network, P _t ^M <0 is the power of the power distribution network to be dispatched selling electricity to the main network in the period of t, a _M,t Is the real-time electricity rate for the period t,

the difference proportion between the price of purchasing and selling electricity and the real-time price for the main network;

the constraint of the electric energy transaction between the power distribution network to be scheduled and the main network is as follows:

wherein,

for a period of time t reactive power flows from the main network to the distribution network to be scheduled,

apparent power flowing from the main network to the distribution network to be scheduled for a period of time t,S ^M is the minimum capacity of the transmission line,

is the maximum capacity of the transmission line.

As a preferred scheme, the constructing risk constraints of the node voltage and the branch power of the power distribution network to be scheduled includes:

constructing an internal power flow calculation model of the power distribution network to be scheduled:

wherein, P _i,t Net injection of active power, Q, for node i during time t _i,t Net injection of reactive power, P, for node i during time t _ij,t For the active power, Q, flowing on branch ij during time t _ij,t The reactive power flowing through the branch ij in the time period t, N is the set of all nodes in the power distribution network to be scheduled, B _ij Characterizing the Power flow on leg ij, S _ij，t For the apparent power flowing on branch ij for the period t,

for the active power of photovoltaic power generation at node i during the period t,

for the active power generated by the wind at node i during time t,

for the active power of the load of node i during time t,

the reactive power generated by the diesel engine set at the node i in the period t,

for the reactive power of the wind power generation at node i during the period t,

reactive power of the load at node i for a period of t, N ₀ A node set which is connected with the main network for the power distribution network;

constructing risk constraints of node voltage amplitude and branch apparent power of the power distribution network to be scheduled on the basis of the internal load flow calculation model, specifically:

wherein,

for a time period t, the node voltage amplitude risk of the power distribution network to be scheduled,

for a time period t, the branch apparent power risk, epsilon, of the node voltage of the power distribution network to be scheduled _V A node voltage amplitude risk threshold value epsilon of the power distribution network to be dispatched _S A branch apparent power risk threshold value, w, for the distribution network to be scheduled _i Is the weight of node i, w _ij Is the weight of branch ij.

As a preferred scheme, the obtaining of the state variable, the action variable, and the reward function of the scheduling model specifically includes:

defining a state variable s of the scheduling model over a period t _t And an action variable a during a period t _t ：

Wherein,

for node i wind forceThe active power of the power generation in the t period,

the active power of photovoltaic power generation at the node i in the period t,

for the load of node i during time t, α _M,t For real-time electricity prices, SOC _i,t The active power of the energy storage system at the node i in the period t,

for the active power of the node i diesel engine set during the time period t,

charging and discharging power of the energy storage system at the node i in a time period t;

defining a reward function for the scheduling model:

wherein, r(s) _t ,α _t ) For agent in state s _t Take action a _t The result of the award is that the user can,

for the total running cost weighted for the period t,

for the total penalty, ω, weighted over time t ₁ 、ω ₂ 、ω ₃ And ω ₄ The weights of the reward components.

As a preferred scheme, the building a markov decision process for the scheduling model based on the state variable, the action variable and the reward function specifically includes:

constructing the Markov decision process according to the following equation:

wherein,

is a space of states that is, for example,

is a space for the movement of the robot,

is a state transition probability function and r is a reward function.

Correspondingly, the embodiment of the invention also provides a power distribution network scheduling device based on deep reinforcement learning, which comprises the following components:

the system comprises a constraint module, a master network and a scheduling module, wherein the constraint module is used for constructing operation constraints and cost functions corresponding to a plurality of devices of a power distribution network to be scheduled, constructing constraints and cost functions corresponding to electric energy transactions of the power distribution network to be scheduled and a master network, constructing risk constraints of node voltage and branch power of the power distribution network to be scheduled, and obtaining a scheduling model of the power distribution network to be scheduled;

the Markov decision process building module is used for obtaining a state variable, an action variable and a reward function of the scheduling model and building a Markov decision process for the scheduling model based on the state variable, the action variable and the reward function;

the training module is used for training a strategy network corresponding to the Markov decision process through a SAC algorithm in combination with basic data in the Markov decision process;

and the scheduling module is used for scheduling the power distribution network to be scheduled based on the output of the trained strategy network.

Correspondingly, the embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium comprises a stored computer program; when the computer program runs, the equipment where the computer readable storage medium is located is controlled to execute the power distribution network scheduling method based on deep reinforcement learning.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a power distribution network scheduling method and device based on deep reinforcement learning and a computer readable storage medium, wherein the method comprises the following steps: constructing operation constraints and cost functions respectively corresponding to a plurality of devices for a power distribution network to be scheduled, constructing constraints and cost functions corresponding to electric energy transactions between the power distribution network to be scheduled and a main network, constructing node voltage and branch power risk constraints of the power distribution network to be scheduled, and obtaining a scheduling model of the power distribution network to be scheduled; acquiring state variables, action variables and a reward function of the scheduling model, and constructing a Markov decision process for the scheduling model based on the state variables, the action variables and the reward function; training a strategy network corresponding to the Markov decision process through a SAC algorithm in combination with basic data in the Markov decision process; and scheduling the power distribution network to be scheduled based on the output of the trained strategy network. Compared with the prior art, the method has the advantages that a Markov decision process is built, the strategy network trained through the SAC algorithm can adapt to online operation and complex calculation, millisecond-level rapid calculation is realized, and generalization capability is obviously improved.

Drawings

FIG. 1: the invention provides a flow diagram of an embodiment of a power distribution network scheduling method based on deep reinforcement learning.

FIG. 2: the invention provides a charge state schematic diagram of an embodiment of a power distribution network energy storage system.

FIG. 3: the invention provides a schematic diagram of a training process of an embodiment of a policy network.

FIG. 4: the invention provides a schematic structural diagram of an embodiment of a power distribution network dispatching device based on deep reinforcement learning.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

referring to fig. 1, fig. 1 is a power distribution network scheduling method based on deep reinforcement learning according to an embodiment of the present invention, including steps S1 to S4, where:

step S1, constructing operation constraints and cost functions respectively corresponding to a plurality of devices for a power distribution network to be scheduled, constructing constraints and cost functions corresponding to electric energy transaction between the power distribution network to be scheduled and a main network, constructing risk constraints of node voltage and branch power of the power distribution network to be scheduled, and obtaining a scheduling model (further, an economic scheduling model) of the power distribution network to be scheduled.

In this embodiment, the plurality of devices includes no less than one diesel unit and no less than one energy storage system;

the operation constraint of the diesel engine set is as follows:

wherein,

for the active output of the ith diesel engine set (or the diesel engine set at the ith node) in the power distribution network to be dispatched in the period t,P _i ^G for the minimum active power of the ith diesel engine set of the power distribution network to be dispatched,

for the set of all nodes connected to the diesel cluster,

is the set of all periods in the scheduling cycle.

The cost function of the diesel engine set is as follows:

wherein,

The operation constraint of the energy storage system is as follows:

wherein,

for a period t, the active output of the ith energy storage system (or the energy storage system at the ith node) in the power distribution network to be scheduled: (

Which is indicative of a discharge of electricity,

indicating charging),

the maximum discharge power of the ith energy storage system in the power distribution network to be scheduled is obtained, and

and

are all larger than 0, and are all larger than 0,

for the set, SOC, of all nodes connected with the energy storage system in the power distribution network to be scheduled _i,t For the time period t, the charge state of the ith energy storage system in the power distribution network to be dispatched,SOC _i,t for the t period, the minimum state of charge allowed by the ith energy storage system in the power distribution network to be dispatched,

the maximum charge state, eta, allowed by the ith energy storage system in the t period of the power distribution network to be scheduled _C Charging power, η, for energy storage systems _D For the discharge power (eta) of the energy storage system _C ,η _D ∈[0,1])，E _i And the capacity of the ith energy storage system in the power distribution network to be dispatched is obtained.

In the equation, the first constraint characterizes a capacity limit of the converter to which the energy storage system is connected. The second constraint is to avoid overcharge and overdischarge conditions that may cause a decay in the life of the energy storage system. The third constraint characterizes the relationship between the charge state of the energy storage system in the next period and the charge state and the charge and discharge power of the energy storage system in the current period. In order to facilitate the next cycle scheduling, the SOC of the last period of each scheduling cycle should be returned to the initial value, i.e., the SOC _i,0 ＝SOC _i,T 。

And,SOC _i,t ,

as the current time t changes, instead of maintaining a constant, refer to fig. 2. Wherein,

the slopes of the sections A-B and E-D are the minimum charge state and the maximum charge state allowed by the ith energy storage system of the power distribution network

The slopes of the C-D and A-F segments are

The cost function of the energy storage system is:

wherein,

is the sum of charging and discharging costs of all stored energy of the power distribution network to be dispatched in the period of t, a _E,i And the cost coefficient is the cost coefficient of the ith energy storage system in the power distribution network to be dispatched.

Further, the cost function of the electric energy transaction between the power distribution network to be scheduled and the main network is as follows:

wherein,

the difference proportion between the electricity purchasing price and the electricity selling price of the main network and the real-time price aims to enable the electricity purchasing price to be lower than the electricity selling price to be lower than the main network, promote the internal power consumption of the power distribution network and reduce the negative influence of the internal disturbance of the power distribution network on the main network.

The constraint of the transaction between the power distribution network to be scheduled and the main network electric energy is as follows:

wherein,

for a period t of reactive power flowing from the main network to the distribution network to be scheduled,

is the maximum capacity of the transmission line.

And constructing risk constraints of the node voltage and branch power of the power distribution network to be scheduled, including:

wherein, P _i,t Net injection of active power, Q, for node i during time t _i,t Net injection of reactive power, P, for node i during time t _ij,t For the active power flowing on branch ij (branch ij i.e. the branch from node i to node j), Q during period t _ij,t The reactive power flowing through the branch ij in the t time period, N is the set of all nodes in the power distribution network to be dispatched, B _ij Characterizing the Power flow on leg ij, S _ij，t For the apparent power flowing on branch ij for the period t,

for the active power generated by the wind at node i during time t,

the active power of the load of node i for a period t (if node i is not connected to the corresponding device, correspondingly,

or

Or

Is 0) in the first step,

for the reactive power generated by the diesel engine set at the node i in the period t,

reactive power of the load at node i for a period of t, N ₀ The set of nodes connected to the main network for the distribution network (if node i is not connected to a corresponding device, correspondingly,

or

Or

Is 0).

The node voltage calculation formula is as follows:

V _j,t is the voltage amplitude, V, of node j during time t _i,t The voltage amplitude of node i, r, for a period of t _ij ,x _ij Resistance and reactance, V, of branch ij, respectively ₀ The node voltage at the connection with the main network is a preset value.

wherein,

for a time period t the node voltage amplitude risk of the power distribution network to be scheduled,

for a period t of the node voltage of the distribution network to be scheduledApparent power risk of branch,. Epsilon _V For the node voltage amplitude risk threshold value, epsilon, of the power distribution network to be scheduled _S A branch apparent power risk threshold value, w, for the distribution network to be scheduled _i Is the weight of node i, w _ij Is the weight of branch ij and satisfies

For the time period t the voltage magnitude risk at node i,

for the apparent power risk of branch ij,

is the set of all branches in the distribution network.

The node voltage magnitude risk and branch apparent power risk are defined as integrating the product of the probability density function and the severity function:

wherein PDF (V) _i,t ),PDF(S _ij,t ) Respectively, node voltage amplitude V _i,t And branch apparent power S _ij,t The probability density function can be obtained by probability load flow calculation, for example, a point estimation method is adopted to combine with Gram-Charlier expansion; sev ^V (V _i,t ),Sev ^S (S _ij,t ) As the node voltage amplitude V _i,t Sum branch apparent power S _ij,t Meets the following criteria:

V,

respectively the lower limit and the upper limit of the node voltage amplitude,S,

respectively, the lower and upper limits of the branch apparent power.

And S2, acquiring a state variable, an action variable and a reward function of the scheduling model, and constructing a Markov decision process for the scheduling model based on the state variable, the action variable and the reward function.

In this embodiment, the obtaining of the state variables, the action variables, and the reward function of the scheduling model is preferably:

Wherein,

the active power for the time period t for the wind power generation of node i,

active power for photovoltaic power generation at node i during time period t,

for the load of node i during t, α _M,t For real-time electricity prices, SOC _i,t The active power of the energy storage system at the node i in the period t,

for the active power of the node i diesel engine set during the time period t,

and charging and discharging power of the energy storage system at the node i in the period t.

Wind power generation, photovoltaic power generation, load and electricity price are exogenous state variables, are determined by system uncertainty and are not influenced by action variables; the energy storage state of charge is an endogenous state variable, which is affected by the action variables. For a foreign state, state transitions are implemented by reading data for the next time period in the dataset; for endogenous states, the state transition is achieved by calculating the state of charge for the next time period. The definition of the action variable is based on the decision variable of the optimization model, but the active exchange with the main network can be carried out by the active power of each node diesel engine set

And the charging and discharging power of each node energy storage system

Combined with the power flow calculation, and therefore not within the range defined by the action variables.

At the same time, a reward function of the scheduling model is defined:

wherein r(s) _t ,a _t ) For agent in state s _t Take action a _t The result of the award is that the user can,

the weighted total operation cost (including the fuel cost of the diesel engine set, the carbon emission cost, the charge and discharge cost of the energy storage system and the cost of purchasing electricity from the main network) in the period t,

the total punishment (comprising punishment of violating the state of charge constraint, punishment of node voltage amplitude value out-of-limit risk and punishment of branch apparent power out-of-limit) after t time period weighting is carried out, and omega is ₁ 、ω ₂ 、ω ₃ And ω ₄ The weights of the reward components.

The agent learns in interaction with the environment, in particular the agent perceives the current environmental state s _t And performing action a _t The environment shifts to the next state s _t+1 The agent obtains a reward r(s) _t ,a _t )。

And, based on the state variables, the action variables and the reward function, constructing a markov decision process for the scheduling model, specifically according to the following equation:

wherein,

in the form of a state space, the state space,

is a space for the movement of the user,

is a state transition probability function and r is a reward function. The goal of the agent is to maximize long-term accumulated rewards by interacting with the environment. The cost and penalty for defining the reward function as negative is therefore to guide the agent to minimize the running cost and meet the constraints.

And S3, training a strategy network corresponding to the Markov decision process through a SAC algorithm in combination with basic data in the Markov decision process.

Specifically, in this embodiment, parameters of the SAC algorithm are updated by an ASAM algorithm and a PER algorithm, and an agent and a policy network corresponding to the markov decision process are trained by the updated SAC algorithm; wherein the parameters of the SAC algorithm comprise soft Q network parameters, temperature coefficients and network parameters of the policy network.

The objective function of the SAC algorithm is the maximization of:

wherein, pi (a) _t |s _t ) For agent in state s _t Take action a _t Probability of (p) _π For the state action trace generated by the policy pi,

is the entropy of the strategy pi, alpha is a temperature coefficient which is used for reflecting the relative importance of strategy entropy and reward in an objective function of the SAC algorithm, when alpha → 0 the objective function is degenerated to the maximization of long-term accumulated reward in the conventional reinforcement learning algorithm,

is a mathematical expectation. SAC algorithm by applying in a target functionThe maximization of the strategy entropy added into the number can effectively promote the intelligent agent to explore unknown state action space, and the learning speed of the intelligent agent is improved.

The SAC algorithm is based on an artificial neural network, so the soft Q function is parameterized as Q _θ (s _t ,a _t ) The network parameter of the soft Q network is theta, and the Gaussian strategy is parameterized to pi _φ (a _t |s _t ) The strategic network parameter is phi.

The input of the soft Q network is a state and an action, and the output is a 1-dimensional Q value of the state-action pair; the input of the strategy network is a state, and the output is the mean value and standard deviation of Gaussian action. In order to relieve the over-estimation problem of the soft Q function, two soft Q networks need to be established and independently trained simultaneously in the algorithm, and the network parameter is theta _i (i =1, 2) and the smaller Q value output in both networks is used to update the parameters of the soft Q network and the policy network. Record of interaction of agent with environment(s) _t ,a _t ,s _t+1 ,r _t ) Is stored in an experience replay pool, and each time network parameters are updated, partial samples are extracted from the experience replay pool to perform random gradient descent.

For parameter update of soft Q network, by soft bellman residual:

wherein,

is an empirical playback pool. The soft Q network has a corresponding target network and parameters thereof

The method is obtained by soft updating on the basis of soft Q network parameters:

tau is eyeThe smoothing factor of the target network is much less than 1. In soft Bellman residual difference

Substituting the smaller of the two target network output Q values:

in order to improve generalization capability, an Adaptive Sharpness Aware Minimization (ASAM) algorithm is introduced into parameter update of the soft Q network, and an objective function of the algorithm is as follows:

wherein e is _i As a network parameter theta _i (i =1, 2), p being a hyper-parameter defining this neighborhood,

for a normalized operator of a network parameter, for a fully connected network:

and λ is a weight attenuation coefficient of L2 regularization, which is the weight coefficient of the kth layer of the ith soft Q network.

For policy networks, however, the updated goal is to minimize the Kullback-Leibler divergence of the policy:

wherein Q is _θ (s _t ,a _t ) Substituting two soft Q network outputsThe smaller of the Q values. The temperature coefficient alpha measures the trade-off between reward and strategy entropy in the objective function. The magnitude of the reward function has a direct effect on the temperature coefficient α, so that the performance of the SAC algorithm is impaired unless the temperature coefficient is adjusted in different tasks or during training of the same task. During training, automatic temperature coefficient adjustment is performed with the goal of minimizing:

wherein,

is the goal of policy entropy.

And updating soft Q network parameters, strategy network parameters and temperature coefficients based on random gradient descent. Updating of soft Q network parameters requires solving a min-max type optimization problem:

first, the max problem of the inner layer is approximated by first-order Taylor expansion, and then the optimal epsilon is solved _i And then updates theta by gradient descent _i . The network parameter update formula is as follows:

wherein λ is _Q ,λ _π Learning rates for soft Q network and policy network respectively,λ _α to update the step size of the temperature coefficient alpha.

Secondly, giving priority to each sample based on the absolute value of a time sequence difference (TD) error through a Prioritized Experience Replay (PER) algorithm, and carrying out differentiation processing on sampling probability:

wherein P (k) is the sampling probability of the kth sample in the empirical replay pool, P (k) is the priority of the kth sample in the empirical replay pool, and beta ₁ Measure the degree of priority (. Beta.) ₁ And =0 is an equiprobable sample). In proportional prioritization, the priority p (k) is defined as follows:

p(k)＝|δ(k)|+ε；

and delta (k) is the TD error of the kth sample in the empirical replay pool, namely, the sample with the larger absolute value of the TD error is considered to have higher learning value. Epsilon is a small positive number that ensures that there is some probability of being sampled even if the TD error is 0.

For the ith soft Q network, TD error delta _i The calculation of (c) is closely related to the loss function:

the TD error for updating the k-th sample priority in the empirical replay pool is given by the above equation δ _i Average value of (i =1,2).

The precedence in sampling introduces a bias in the soft Q function estimate, and therefore the bias IS removed by weighting the samples with Importance Sampling (IS) in calculating the loss function, including:

w _k to empirically playback the IS weights for the kth sample in the pool, normalization IS required for stability. N is the size of the empirical recovery tank, beta ₂ The compensation strength of IS weight IS when beta ₂ And the compensation is complete when the value is 1. Initial value at the beginning of training, beta ₂ The linear increase to 1 at the end of training.

The training of the policy network corresponding to the markov decision process comprises (with reference to figure 3):

step S31, randomly initializing a strategy network parameter phi and 2 soft Q network parameters theta ₁ ,θ ₂ And copying the soft Q network parameters to the corresponding target network:

step S32, in each period of each scheduling cycle, the intelligent agent senses the environment state, and reads the wind power generation, the photovoltaic power generation, the load and the electricity price in the current period and the charge state of the energy storage system in the current period from a data set (basic data comprises the data set and a historical data set) for training; sampling and executing action a according to the action mean value and the action variance output by the strategy network and the Gaussian distribution _t ～π _φ (a _t |s _t ) The environment shifts to the next state s _t+1 Reading wind power generation, photovoltaic power generation, load and electricity price in the next time period from a training data set, calculating the charge state of an energy storage system in the next time period, wherein the basic data comprises the training data set, and an intelligent agent obtains a reward r(s) _t ,a _t ) A sample(s) _t ,a _t ,s _t+1 ,r _t ) With current maximum priority p = max _j p _j And storing the experience data into an experience recovery pool.

Step S33, in each period of each scheduling cycle, extracting the kth sample from the empirical recovery pool with probability P (k), and calculating the IS weight w corresponding to the kth sample _k And TD error delta (k) and updates its priority p (k) with IS weight w _k Cumulative soft Q network loss function J _Q (θ _i ) The process co-extractsn samples.

Step S34, in each period of each scheduling cycle, calculating the optimal neighborhood e of the network parameters defined by the self-adaptive sharpness _i Updating soft Q network parameter theta based on gradient descent ₁ ,θ ₂ Strategy network parameter phi and temperature coefficient alpha, and target network parameter

And performing soft updating.

And step S35, repeating the steps S32 to S34 until the current scheduling period is finished.

And S36, repeating the steps S32 to S35 until the number of the dispatching cycles reaches a preset value and the cycle reward curve tends to be stable.

And S4, scheduling the power distribution network to be scheduled based on the output of the trained strategy network.

In this embodiment, through the trained policy network, at each time interval, the agent perceives the current environment state s _t Reading wind power generation, photovoltaic power generation, load and electricity price in the current time period from the real-time data, reading the charge state of the energy storage system in the current time period, and executing an action a according to an action mean value output by a strategy network _t The environment shifts to the next state s _t+1 Reading the wind power generation, the photovoltaic power generation, the load and the electricity price in the next time period from the real-time data, calculating the charge state of the energy storage system in the next time period, and obtaining the reward r(s) by the intelligent body _t ,a _t ) The same steps are performed for each period until the end of the current scheduling cycle.

Correspondingly, referring to fig. 4, an embodiment of the present invention further provides a power distribution network scheduling apparatus based on deep reinforcement learning, including:

the constraint module 101 is configured to construct an operation constraint and a cost function respectively corresponding to a plurality of devices for a power distribution network to be scheduled, construct a constraint and a cost function corresponding to electric energy transaction between the power distribution network to be scheduled and a main network, construct a risk constraint of node voltage and branch power of the power distribution network to be scheduled, and obtain a scheduling model of the power distribution network to be scheduled;

a markov decision process constructing module 102, configured to obtain a state variable, an action variable, and a reward function of the scheduling model, and construct a markov decision process for the scheduling model based on the state variable, the action variable, and the reward function;

a training module 103, configured to train, in the markov decision process, a policy network corresponding to the markov decision process through a SAC algorithm in combination with basic data;

and the scheduling module 104 is configured to schedule the power distribution network to be scheduled based on the output of the trained policy network.

Correspondingly, the embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium comprises a stored computer program; when the computer program runs, the device where the computer readable storage medium is located is controlled to execute the power distribution network scheduling method based on the deep reinforcement learning.

The module integrated by the distribution network dispatching device based on deep reinforcement learning can be stored in a computer readable storage medium if the module is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, and software distribution medium, etc.

It should be noted that the above-described apparatuses are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement without inventive effort.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the invention, may occur to those skilled in the art and are intended to be included within the scope of the invention.

Claims

1. A power distribution network scheduling method based on deep reinforcement learning is characterized by comprising the following steps:

constructing operation constraints and cost functions corresponding to a plurality of devices respectively for a power distribution network to be scheduled, constructing constraints and cost functions corresponding to electric energy transactions of the power distribution network to be scheduled and a main network, constructing node voltage and branch power risk constraints of the power distribution network to be scheduled, and obtaining a scheduling model of the power distribution network to be scheduled;

training a strategy network corresponding to the Markov decision process through a SAC algorithm in combination with basic data in the Markov decision process;

2. The power distribution network scheduling method based on deep reinforcement learning according to claim 1, wherein the training of the policy network corresponding to the markov decision process by the SAC algorithm specifically comprises:

3. The power distribution network dispatching method based on deep reinforcement learning of claim 1, wherein the plurality of devices comprise not less than one diesel engine set and not less than one energy storage system;

the operation constraint of the diesel engine set is as follows:

wherein,

the active power output of the ith diesel engine set in the power distribution network to be scheduled in the period of t,P _i ^G for the minimum active power of the ith diesel engine set of the power distribution network to be dispatched,

for the set of all the nodes to which the diesel group is connected,

is a set of all time periods in the scheduling cycle;

the cost function of the diesel engine set is:

wherein,

is the sum of the carbon emission costs of all the diesel engine sets in the power distribution network to be scheduled at the time period t, a _G,i 、b _G,i And c _G,i For the fuel cost factor of the i-th diesel unit, d _G,i And e _G,i Is the carbon emission cost coefficient of the ith diesel unit.

4. The power distribution network scheduling method based on deep reinforcement learning of claim 3, wherein the operation constraints of the energy storage system are as follows:

wherein,

the maximum discharge power of the ith energy storage system in the power distribution network to be scheduled,

the maximum charge state, eta, allowed by the ith energy storage system in the period t in the power distribution network to be scheduled _C Charging power, η, for energy storage systems _D For the discharge power of the energy storage system, E _i The capacity of the ith energy storage system in the power distribution network to be dispatched is obtained;

the cost function of the energy storage system is:

wherein,

is the sum of the charge and discharge costs, a, of all stored energy of the power distribution network to be scheduled in the period of t _E,i And the cost coefficient is the cost coefficient of the ith energy storage system in the power distribution network to be dispatched.

5. The distribution network scheduling method based on deep reinforcement learning of claim 4, wherein the cost function of the electric power transaction between the distribution network to be scheduled and the main network is as follows:

wherein,

the cost P of purchasing electricity from the main network for the power distribution network to be scheduled in the period t _t ^M >And 0 is t time period, the power of the power distribution network to be scheduled for purchasing power from the main network, P _t ^M <0 is the power of the power distribution network to be scheduled for selling electricity to the main network in the period of t, a _M,t Is the real-time electricity rate for the period t,

wherein,

apparent power flowing from the main network to the power distribution network to be scheduled for a period t,S ^M is the minimum capacity of the transmission line,

is the maximum capacity of the transmission line.

6. The power distribution network scheduling method based on deep reinforcement learning of claim 5, wherein the constructing of the risk constraints of the node voltage and the branch power of the power distribution network to be scheduled comprises:

wherein, P _i,t Net injection of active power, Q, for node i during time t _i,t Net injection of reactive power, P, for node i during time t _ij,t For the active power, Q, flowing on branch ij during time t _ij,t The reactive power flowing through the branch ij in the time period t, N is the set of all nodes in the power distribution network to be scheduled, B _ij Characterizing the power flow direction, S, over branch ij _ij，t For the apparent power flowing on branch ij for the period t,

for the active power generated by the wind at node i during time t,

for the active power of the load of node i during time t,

constructing risk constraints of node voltage amplitude and branch apparent power of the power distribution network to be scheduled on the basis of the internal power flow calculation model, and specifically:

wherein,

for the t time period, the power distribution network to be scheduledOf the node voltage of (c) is a branch apparent power risk of _V For the node voltage amplitude risk threshold value, epsilon, of the power distribution network to be scheduled _S An apparent power risk threshold, w, for the branch of the distribution network to be scheduled _i Is the weight of node i, w _ij Is the weight of the branch ij and,

for the time period t the voltage magnitude risk at node i,

for the apparent power risk of branch ij,

is the set of all branches in the distribution network.

7. The power distribution network scheduling method based on deep reinforcement learning according to claim 6, wherein the obtaining of the state variables, the action variables and the reward functions of the scheduling model specifically includes:

Wherein,

the active power for the time period t for the wind power generation of node i,

for node i photovoltaic generatorThe active power of the electricity during the time period t,

for the active power of the node i diesel engine set during the time period t,

charging and discharging power of the energy storage system of the node i in a time period t;

and defining a reward function of the scheduling model:

wherein r(s) _t ,a _t ) For agents in state s _t Take action a _t The result of the award is that the user can,

for the total running cost weighted for the period t,

for the total penalty, ω, weighted over time t ₁ 、ω ₂ 、ω ₃ And omega ₄ The weights of the reward components.

8. The distribution network dispatching method based on deep reinforcement learning of claim 7, wherein the Markov decision process is constructed for the dispatching model based on the state variables, the action variables and the reward function, and specifically comprises:

constructing the Markov decision process according to the following equation:

wherein,

is a space of states that is, for example,

is a space for the movement of the user,

is a state transition probability function and r is a reward function.

9. The utility model provides a distribution network scheduling device based on deep reinforcement learning which characterized in that includes:

the training module is used for training a strategy network corresponding to the Markov decision process through an SAC algorithm in combination with basic data in the Markov decision process;

10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program; the computer program controls, when executed, an apparatus on which the computer readable storage medium is located to execute the method for scheduling a power distribution network based on deep reinforcement learning according to any one of claims 1 to 8.