CN112149347B

CN112149347B - Power distribution network load transfer method based on deep reinforcement learning

Info

Publication number: CN112149347B
Application number: CN202010974175.8A
Authority: CN
Inventors: 张沛; 宋秉睿; 李家腾; 吕晓茜; 孟祥飞
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2023-12-26
Anticipated expiration: 2040-09-16
Also published as: CN112149347A

Abstract

The invention provides a power distribution network load transfer method based on deep reinforcement learning. The method comprises the following steps: the power distribution network fails and starts load transfer; inputting real-time state information of the power distribution network to an intelligent agent, calculating a motion evaluation vector, and selecting corresponding motion according to a motion strategy based on the motion evaluation vector; the intelligent agent executes the action on the power distribution network, evaluates the action and the state of the power distribution network after the action, calculates rewards Reward according to constraint conditions and objective functions, determines the value of Done according to the rewards Reward and ending rules, and updates parameters of the intelligent agent; judging whether to end the sequence action according to the end flag bit. The method utilizes the deep reinforcement learning to improve the fault emergency recovery capability and reliability of the power distribution network, and the power distribution network load transfer algorithm based on the deep reinforcement learning avoids a large number of calculation and power network simulation iteration during faults, improves the load transfer speed and ensures that the power distribution network has higher reliability.

Description

Distribution network load transfer method based on deep reinforcement learning

技术领域Technical field

本发明涉及配电网故障处理技术领域，尤其涉及一种基于深度强化学习的配电网负荷转供方法。The present invention relates to the technical field of distribution network fault processing, and in particular to a distribution network load transfer method based on deep reinforcement learning.

背景技术Background technique

随着我国国民经济的快速发展，尤其是第三产业的用电规模逐渐扩大，中小型用户与居民用电比例逐渐增加，电力负荷的结构出现了一些变化，配电网节点数大量增加，线路也越来越长，结构愈加复杂，故障的几率相应增大。因此配电网在发生故障后，可以通过调整网络开关的开合状态来切除线路故障，隔离故障并转移故障影响区内负荷，以减少故障影响范围，从而总体提高电网运行的经济性和安全性。With the rapid development of my country's national economy, especially the gradual expansion of the scale of electricity consumption in the tertiary industry, the proportion of small and medium-sized users and residents has gradually increased, the structure of power loads has undergone some changes, the number of nodes in the distribution network has increased significantly, and the number of lines has increased significantly. They are also getting longer and longer, the structure becomes more complex, and the probability of failure increases accordingly. Therefore, after a fault occurs in the distribution network, the line fault can be removed by adjusting the opening and closing status of the network switch, isolating the fault and transferring the load in the fault affected area to reduce the scope of the fault impact, thereby overall improving the economy and safety of the power grid operation. .

目前，国内外许多学者对负荷转供提出的方法基本可以分成以下几类：启发式算法、数学优化法、专家系统法和人工智能算法。上述算法均可获得可行的转供方案输出，但均存在一定的缺陷。At present, the methods proposed by many domestic and foreign scholars for load transfer can basically be divided into the following categories: heuristic algorithms, mathematical optimization methods, expert system methods and artificial intelligence algorithms. The above algorithms can all obtain feasible transfer solution output, but they all have certain flaws.

如基于直观或经验构造，模拟思维逻辑的启发式算法，它根据联络开关的剩余容量，失电区域的位置划分，尝试用简单的操作，试图一次提供解决方案，方案的最优性很难达到，极易陷入局部最优解，得到解的优劣非常依赖于网络的初始状态，这种方法虽然不需要太多次的潮流计算，在目前各种算法中实时性相对较好，但依然需要进行多次潮流求解对解决方案进行选择，其依然不能满足配电网负荷转供的实时性要求。For example, heuristic algorithms based on intuition or experience constructs and simulating thinking logic. It is based on the remaining capacity of the contact switch and the location of the power outage area. It attempts to use simple operations to provide a solution at once. The optimality of the solution is difficult to achieve. , it is easy to fall into the local optimal solution, and the quality of the solution depends very much on the initial state of the network. Although this method does not require too many power flow calculations and has relatively good real-time performance among various current algorithms, it still requires After conducting multiple power flow calculations to select a solution, it still cannot meet the real-time requirements of load transfer in the distribution network.

将配电网重构问题用简化数学模型进行描述的数学优化算法，如最优流模式法，将每一条环路合上再打开电流最小的刀闸，当配电网络结构庞大、复杂和维数大时，需要不停反复计算直到趋于稳定，会出现“组合爆炸”的问题；其对于电网仿真过程的优化使求解过程出现许多不确定性因素，对最终结果的准确性影响较大。由于数学优化方法比较简单，不能很好地兼顾复杂的大电网，而且其计算是从局部到整体，极易陷入局部最优解，其计算过程也需要消耗大量时间，造成过长的停电时间，无法满足配电网负荷转供的实时性要求。Mathematical optimization algorithms that describe the distribution network reconstruction problem using simplified mathematical models, such as the optimal flow pattern method, close each loop and then open the knife gate with the smallest current. When the distribution network structure is large, complex and dimensionally When the number is large, it is necessary to repeatedly calculate until it becomes stable, and the problem of "combination explosion" will occur; its optimization of the power grid simulation process causes many uncertainties in the solution process, which has a greater impact on the accuracy of the final result. Because the mathematical optimization method is relatively simple, it cannot take into account the complex large power grid. Moreover, its calculation is from local to whole, and it is easy to fall into the local optimal solution. The calculation process also takes a lot of time, resulting in excessive power outage time. It cannot meet the real-time requirements of distribution network load transfer.

专家系统法能够自动生成恢复故障需要操作的方案并保存在库中，实时性好，适用性广，可应用于网络较大时的方案求解。但专家系统的只是库的建立和集成费时费力，且实际中故障种类多种多样，无法记录包括全部情况。The expert system method can automatically generate the operation plan required to restore the fault and save it in the library. It has good real-time performance and wide applicability, and can be applied to solve the plan when the network is large. However, the establishment and integration of the expert system's library is time-consuming and laborious, and there are many types of faults in practice, making it impossible to record all situations.

传统的人工智能算法主要有一些随机搜索算法与有监督学习算法。随机搜索算法如禁忌搜索算法、粒子群搜索算法、遗传算法计算次数多，计算量大，求解时间较长，可能出现最优解或者不收敛的情况，无法很好兼顾求解速度与全局最优解。有监督学习算法如神经网络法需要建立在以往经验上进行学习，在样本充足的情况下容易找到全局中最优解，但是在缺少有标签数据的情况下难以获得较好的训练结果。这类方法属于在故障发生以后，获得故障信息基础上的搜索最优解的方法，中间需要进行大量的迭代计算与潮流求解，初始解如果距离最优解较远，将消耗大量时间寻找最优解，也无法在短时间内为系统提供较优的解决方案。Traditional artificial intelligence algorithms mainly include random search algorithms and supervised learning algorithms. Random search algorithms such as tabu search algorithm, particle swarm search algorithm, and genetic algorithm have many calculations, a large amount of calculations, and a long solution time. The optimal solution may appear or not converge, and they cannot balance the solution speed and the global optimal solution. . Supervised learning algorithms such as neural network methods need to learn based on past experience. It is easy to find the global optimal solution when there are sufficient samples, but it is difficult to obtain better training results when there is a lack of labeled data. This type of method is a method of searching for the optimal solution based on obtaining fault information after a fault occurs. A large number of iterative calculations and power flow solutions are required. If the initial solution is far from the optimal solution, a lot of time will be spent searching for the optimal solution. solution, it is impossible to provide a better solution for the system in a short time.

发明内容Contents of the invention

本发明的实施例提供了一种基于深度强化学习的配电网负荷转供方法，以克服现有技术的问题。Embodiments of the present invention provide a distribution network load transfer method based on deep reinforcement learning to overcome the problems of the existing technology.

为了实现上述目的，本发明采取了如下技术方案。In order to achieve the above object, the present invention adopts the following technical solutions.

一种基于深度强化学习的配电网负荷转供方法，包括：A distribution network load transfer method based on deep reinforcement learning, including:

步骤1、初始化主神经网络Q(S，A，ω，α，β)和与主神经网络Q的网络结构完全一样的目标网络T(S，A，ω^*，α^*，β^*)，初始化经验经验池R、折扣因子γ、学习率L_r、目标网络更新频率N_replace、抽样数量N_batch，设置结束状态的标志位Done＝0，所述主神经网络Q、目标网络T和经验池构成配电网的智能体；Step 1. Initialize the main neural network Q (S, A, ω, α, β) and the target network T (S, A, ω ^* , α ^* , β ^* ) with the same network structure as the main neural network Q, initialize Experience experience pool R, discount factor γ, learning rate L _r , target network update frequency N _replace , sampling number N _batch , set the end state flag Done = 0, the main neural network Q, target network T and experience pool constitute Intelligent agents in distribution networks;

步骤2、配电网发生故障，开始负荷转供；Step 2. If the distribution network fails, load transfer begins;

步骤3、读取配电网的实时状态信息，将配电网的实时状态信息输入到主所述智能体，所述智能体根据配电网的实时状态信息计算出每个动作的评价值；Step 3. Read the real-time status information of the distribution network, and input the real-time status information of the distribution network to the main agent. The intelligent agent calculates the evaluation value of each action based on the real-time status information of the distribution network;

步骤4、所述智能体基于每个动作的评价值根据动作策略选取相应的动作；Step 4: The agent selects the corresponding action according to the action strategy based on the evaluation value of each action;

步骤5、智能体对配电网执行所述动作，得到动作后配电网的状态S′，对配电网的动作及动作后的状态进行评价,根据约束条件与目标函数计算奖励Reward，根据奖励Reward和结束规则确定Done的值，完成一次配电网开关动作后，将本次配电网开关动作作为经验样本e＝(s，a，r，s′)存储在经验池R中；Step 5. The agent performs the action on the distribution network, obtains the state S′ of the distribution network after the action, evaluates the action on the distribution network and the state after the action, and calculates the reward Reward according to the constraints and objective function. The reward Reward and the end rule determine the value of Done. After completing a distribution network switching action, this distribution network switching action is stored in the experience pool R as an experience sample e=(s, a, r, s′);

步骤6、从经验池R中随机采样所述抽样数量N_batch个经验样本，根据采样的经验样本利用所述折扣因子γ计算目标值，基于所述目标值和学习率L_r通过最小化损失函数对主神经网络Q(S，A，ω，α，β)中的参数ω，α，β进行更新；Step 6: Randomly sample the sampling number N _batch experience samples from the experience pool R, calculate the target value using the discount factor γ according to the sampled experience samples, and minimize the loss function based on the target value and the learning rate L _r Update the parameters ω, α, β in the main neural network Q (S, A, ω, α, β);

步骤7、当主神经网络经过N_replace次更新后，使用主神经网络Q的参数ω，α，β对目标网络T的参数ω^*，α^*，β^*进行更新：Step 7. After the main neural network has been updated N _replace times, use the parameters ω, α, β of the main neural network Q to update the parameters ω ^* , α ^* , β ^* of the target network T:

步骤8、依据结束标志位Done判断是否结束序列动作，Done＝0，返回步骤4；Done＝1，退出循环，本次配电网的负荷转供过程处理结束。Step 8: Determine whether to end the sequence action based on the end flag bit Done. Done=0, return to step 4; Done=1, exit the loop, and the load transfer process of this distribution network is completed.

优选地，所述的步骤1还包括：Preferably, the step 1 also includes:

定义配电网负荷转供操作中的系统状态空间、动作空间以及奖励函数，智能体与配电网环境交互由数组[S，A，P(s，s′)，R(s，a)，Done]表示，其中S表示配电网可能的状态所构成的状态空间，A表示可能的动作集合，P(s，s′)表示从配电网状态s转移到s′的转移概率，R(s，a)是在状态s时采取了a动作，触发了相关的奖励，其被反馈给智能体，Done为结束状态的标志位，智能体主动选择终止本次决策或由于违反约束条件而被环境终止继续操作时，Done被设置为1，正常决策步骤时，Done保持为0；Define the system state space, action space and reward function in the distribution network load transfer operation. The interaction between the agent and the distribution network environment is represented by the array [S, A, P(s, s′), R(s, a), Done] represents, where S represents the state space composed of possible states of the distribution network, A represents the possible action set, P(s, s′) represents the transition probability from the distribution network state s to s′, R( s, a) is the action a taken in state s, which triggers the relevant reward, which is fed back to the agent. Done is the flag of the end state. The agent actively chooses to terminate this decision or is suspended due to violation of constraints. When the environment terminates and the operation continues, Done is set to 1, and during normal decision-making steps, Done remains 0;

状态空间被定义为一个数组S＝[V，I，SW，F]，V是电压向量组，其用来表示配电网中各个节点处所有相位的电压值，V_in为第i个节点的第n个相位的电压值；I为电流向量组，其用来表示配电网中所有线路中各个相位的电流值，I_in为第i条线路的第n个相位的电流值；SW为配电网中所有开关的状态值向量，SW_i为第i个开关的状态，为0表示打开，为1表示闭合；F为表示配电网线路故障状态的向量，F_i为编号为i的线路的故障状态，0表示正常，1表示发生故障。The state space is defined as an array S = [V, I, SW, F]. V is a voltage vector group, which is used to represent the voltage values of all phases at each node in the distribution network. V _in is the voltage value of the i-th node. The voltage value of the n-th phase; I is the current vector group, which is used to represent the current value of each phase in all lines in the distribution network, I _in is the current value of the n-th phase of the i-th line; SW is the distribution The status value vector of all switches in the power grid, SW _i is the status of the i-th switch, 0 means open, 1 means closed; F is a vector representing the fault status of the distribution network line, F _i is the line numbered i The fault status, 0 means normal, 1 means fault has occurred.

优选地，所述的步骤1还包括：Preferably, the step 1 also includes:

所述智能体采用Dueling-DQN算法，所述Dueling-DQN算法利用深度神经网络进行计算，所述用深度神经网络包括主神经网络Q和目标网络T，所述主神经网络Q和目标网络T包括：公共隐藏层、价值函数V和优势函数B；The intelligent agent adopts the Dueling-DQN algorithm. The Dueling-DQN algorithm uses a deep neural network to perform calculations. The deep neural network includes a main neural network Q and a target network T. The main neural network Q and the target network T include : Common hidden layer, value function V and advantage function B;

价值函数V与优势函数B的公共隐藏层采用了2层神经网络，用来提取输入状态量的特征，第一层有30*N_feature个神经元，其中N_feature为输入状态量的个数，所有神经元直接接受状态数据的全连接输入，并添加了偏置bias，激活函数为Relu函数；第二层与第一层进行全连接，也有30*N_feature个神经元；The common hidden layer of value function V and advantage function B uses a 2-layer neural network to extract the characteristics of the input state quantity. The first layer has 30*N _feature neurons, where N _feature is the number of input state quantities. All neurons directly receive fully connected input of state data, and a bias is added, and the activation function is the Relu function; the second layer is fully connected to the first layer, and also has 30*N _feature neurons;

所述智能体采用Dueling-DQN算法对所述主神经网络Q和目标网络T的输出结果进行计算，计算出每个动作的的评价值。The agent uses the Dueling-DQN algorithm to calculate the output results of the main neural network Q and the target network T, and calculates the evaluation value of each action.

优选地，所述的步骤3中的读取配电网的实时状态信息，将配电网的实时状态信息输入到主所述智能体，所述智能体根据配电网的实时状态信息计算出每个动作的评价值，包括：Preferably, the real-time status information of the distribution network is read in step 3, and the real-time status information of the distribution network is input to the main agent. The intelligent agent calculates the real-time status information of the distribution network based on the real-time status information of the distribution network. The evaluation value of each action includes:

所述主神经网络Q和目标网络T中的价值函数V与状态S有关，与动作A无关，其为一个标量，记做V(S，ω，α),优势函数B同时与状态状态S和动作A有关，其为长度为动作数量的一个向量，记为B(S，A，w，β),智能体的价值函数表示为：The value function V in the main neural network Q and the target network T is related to the state S and has nothing to do with the action A. It is a scalar, denoted as V (S, ω, α). The advantage function B is related to the state S and the state S at the same time. It is related to action A, which is a vector whose length is the number of actions, denoted as B(S, A, w, β). The value function of the agent is expressed as:

Q(S，A，ω，α，β)＝V(S，ω，α)+B(S，A，ω，β)Q(S,A,ω,α,β)=V(S,ω,α)+B(S,A,ω,β)

其中，ω是公共部分的网络参数，而α是价值函数独有部分的网络参数，而β是优势函数独有部分的网络参数，最终Q网络的输出由价格函数网络的输出和优势函数网络的输出线性组合得到；Among them, ω is the network parameter of the public part, and α is the network parameter of the unique part of the value function, and β is the network parameter of the unique part of the advantage function. The final output of the Q network is composed of the output of the price function network and the output of the advantage function network. The output is obtained by linear combination;

对优势函数部分做了中心化的处理，实际使用的组合公式如下：The advantage function part has been centralized, and the actual combination formula used is as follows:

其中表示所有动作的集合，/>即求该集合中元素的个数，使用上式计算得到的Q(S，A，ω，α，β)为一个长度为动作数的向量，其中的每个元素代表该状态S下每个动作的评价值。in Represents the set of all actions,/> That is, find the number of elements in the set. Q(S, A, ω, α, β) calculated using the above formula is a vector with a length of the number of actions. Each element in it represents each action in the state S. evaluation value.

优选地，所述的步骤4中的所述智能体基于每个动作的评价值根据动作策略选取相应的动作，包括：Preferably, the agent in step 4 selects the corresponding action according to the action strategy based on the evaluation value of each action, including:

智能体基于动作评价向量根据动作策略选取相应的动作，在非探索模式选择最优动作，该最优动作为评价值Q最高的动作；在探索模式则采取ε-greedy随机贪婪策略，即取随机数x，若x＜ε，则选择评价值Q最高的动作作为本次动作；若x＞ε则从所有动作中选择一个随机动作，所述ε为设定的参数。The agent selects the corresponding action according to the action strategy based on the action evaluation vector, and selects the optimal action in the non-exploration mode, which is the action with the highest evaluation value Q; in the exploration mode, it adopts the ε-greedy random greedy strategy, that is, taking a random Number x, if x < ε, select the action with the highest evaluation value Q as this action; if x > ε, select a random action from all actions, and ε is the set parameter.

优选地，所述的步骤5中的所述智能体对配电网执行所述动作，包括：Preferably, the intelligent agent in step 5 performs the action on the distribution network, including:

所述动作A为一个数字，其范围为0～2N_switch的整数，当动作A为2N_switch时，代表不采取任何操作并退出，本次决策结束；当动作A为0～2N_switch-1时，对动作A作如下计算：The action A is a number, and its range is an integer from 0 to 2N _switch . When action A is 2N _switch , it means no action is taken and the decision is exited. This decision is over; when action A is 0 to 2N _switch -1 , calculate the following for action A:

x＝A％2x=A%2

其中x为A除以2得到的余数，该式的含义如下：Where x is the remainder obtained by dividing A by 2. The meaning of this formula is as follows:

每次动作为对一个开关进行操作或者直接退出，如果退出则本次决策结束。Each action is to operate a switch or exit directly. If you exit, this decision-making ends.

优选地，所述的步骤5中的得到动作后配电网的状态S′，对配电网的动作及动作后的状态进行评价,根据约束条件与目标函数计算奖励Reward，根据奖励Reward和结束规则确定Done的值，包括：Preferably, in step 5, the state S′ of the distribution network after the action is obtained, the action of the distribution network and the state after the action are evaluated, the reward Reward is calculated according to the constraints and the objective function, and the reward Reward and the end are calculated according to the constraint conditions and the objective function. Rules determine the value of Done, including:

设置配电网的约束条件包括：Constraints for setting up a distribution network include:

电压保持在偏差为±7％的容许范围内，对于超出该范围的电压，设置电压惩罚值P_Volt＝-10，并设置结束标志Done为1；对于未超出该范围的电压，设置电压惩罚值P_Volt＝0；The voltage is maintained within the allowable range of ±7% deviation. For voltages that exceed this range, set the voltage penalty value P _Volt = -10, and set the end flag Done to 1; for voltages that do not exceed this range, set the voltage penalty value. P _Volt = 0;

当线路与变压器的通过电流大于其极限值时，设置电流惩罚值P_Lim＝-10，并设置结束标志Done为1；对于未超出大于其极限值的电流，设置电流惩罚值P_Lim＝0；When the current passing through the line and transformer is greater than its limit value, set the current penalty value P _Lim = -10, and set the end flag Done to 1; for the current that does not exceed its limit value, set the current penalty value P _Lim = 0;

设置智能体的环网惩罚P_Loop为：Set the agent's ring network penalty P _Loop as:

设置智能体的无效动作惩罚P_Act为：Set the agent’s invalid action penalty P _Act as:

设置配电网的目标函数包括：The objective functions for setting up the distribution network include:

根据损失负荷的比例设置负荷损失评价值E_Loadloss：Set the load loss evaluation value E _Loadloss according to the proportion of the lost load:

其中，L_loss为失电损失负荷值，L_total为整个电力系统负荷总量，计算得到的E_Loadloss值在-2～2之间；Among them, L _loss is the power loss load value, L _total is the total load of the entire power system, and the calculated E _Loadloss value is between -2 and 2;

对开关的动作次数的评价值E_Num：Evaluation value E _Num for the number of switch operations:

其中，A_Num本次决策发生变化的开关总数量，L_Num为开关总数量，计算得到的E_Num值在-1～1之间。Among them, A _Num is the total number of switches that have changed in this decision, L _Num is the total number of switches, and the calculated E _Num value is between -1 and 1.

对配电网的线损情况的评估值E_Loss：The evaluation value E _Loss of the line loss of the distribution network:

其中，Line为未停电线路总数，I_i为第i条线路的实际电流，R_i为第i条线路与变压器的电阻，S为全网总功率；Among them, Line is the total number of non-outage lines, I _i is the actual current of the i-th line, R _i is the resistance of the i-th line and the transformer, and S is the total power of the entire network;

对于电压未超出±7％范围的节点，线路的电压偏移程度评价值E_Vot：For nodes whose voltage does not exceed the range of ±7%, the voltage offset evaluation value E _Vot of the line:

其中，N为未停电节点总数量，pu_i为节点i的电压标幺值；Among them, N is the total number of nodes without power outage, and pu _i is the voltage per unit value of node i;

环境给出的奖励函数由以上各评价值的总和构成，即Reward：The reward function given by the environment is composed of the sum of the above evaluation values, namely Reward:

Reward＝P_Volt+P_Lim+P_Loop+P_Act+E_Loadloss+E_Num+E_Loss+E_Vot。Reward＝P _Volt +P _Lim +P _Loop +P _Act +E _Loadloss +E _Num +E _Loss +E _Vot .

优选地，所述的步骤6中的从经验池R中随机采样所述抽样数量N_batch个经验样本，根据采样的经验样本利用所述折扣因子γ计算目标值，基于所述目标值和学习率L_r通过最小化损失函数对主神经网络Q(S，A，ω，α，β)中的参数ω，α，β进行更新，包括：Preferably, the sampling number N _batch experience samples are randomly sampled from the experience pool R in step 6, and the target value is calculated using the discount factor γ according to the sampled experience samples, based on the target value and the learning rate. L _r updates the parameters ω, α, β in the main neural network Q (S, A, ω, α, β) by minimizing the loss function, including:

从经验池R中随机采样N_batch个经验样本e_i＝(s_i，a_i，r_i，s′_i)，N_batch＝20，计算目标值通过最小化损失函数对主神经网络Q(S，A，ω，α，β)中的参数ω，α，β进行更新，以RMSProp算法求参数的更新程度，学习率L_r为0.1，对主神经网络Q的一次更新代表了一次智能体的学习过程。Randomly sample N _batches of experience samples e _i = (s _i , a _i , r _i , s′ _i ) from the experience pool R, N _batch =20, and calculate the target value By minimizing the loss function Update the parameters ω, α, β in the main neural network Q (S, A, ω, α, β), and use the RMSProp algorithm to find the update degree of the parameters. The learning rate L _r is 0.1. For the main neural network Q, once Update represents a learning process of the agent.

由上述本发明的实施例提供的技术方案可以看出，本申请的方法利用深度强化学习来提高配电网的故障应急恢复能力与可靠性，基于深度强化学习的配电网负荷转供算法避免了故障时的大量运算与电网仿真迭代，提高了负荷转供的速度，使配电网具有更高的可靠性。采用强化学习的算法，通过训练与经验学习，在发生故障时，无需消耗大量时间进行仿真计算分析，直接通过分析实时运行大数据而进行负荷转供决策，可以以更快给出更好的转供策略。It can be seen from the technical solutions provided by the above embodiments of the present invention that the method of the present application uses deep reinforcement learning to improve the fault emergency recovery capability and reliability of the distribution network. The distribution network load transfer algorithm based on deep reinforcement learning avoids It eliminates a large number of calculations and grid simulation iterations during faults, improves the speed of load transfer, and makes the distribution network more reliable. Using reinforcement learning algorithms, through training and experience learning, when a fault occurs, there is no need to spend a lot of time on simulation calculations and analysis, and load transfer decisions can be made directly by analyzing real-time running big data, which can provide faster and better transfer decisions. Provide strategies.

本发明附加的方面和优点将在下面的描述中部分给出，这些将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in part in the description which follows, and will be obvious from the description, or may be learned by practice of the invention.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1为本申请实施例提供的一种负荷转供决策到强化学习的映射关系示意图；Figure 1 is a schematic diagram of the mapping relationship between load transfer decision-making and reinforcement learning provided by the embodiment of the present application;

图2为本申请实施例提供的一种神经网络的结构图；Figure 2 is a structural diagram of a neural network provided by an embodiment of the present application;

图3为本发明实施例提供的一种基于深度强化学习的配电网负荷转供方法的处理流程图。Figure 3 is a processing flow chart of a distribution network load transfer method based on deep reinforcement learning provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施方式，所述实施方式的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的，仅用于解释本发明，而不能解释为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements with the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are only used to explain the present invention and cannot be construed as limitations of the present invention.

本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件，但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解，当我们称元件被“连接”或“耦接”到另一元件时，它可以直接连接或耦接到其他元件，或者也可以存在中间元件。此外，这里使用的“连接”或“耦接”可以包括无线连接或耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的任一单元和全部组合。Those skilled in the art will understand that, unless expressly stated otherwise, the singular forms "a", "an", "the" and "the" used herein may also include the plural form. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof. It will be understood that when we refer to an element being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Additionally, "connected" or "coupled" as used herein may include wireless connections or couplings. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

本技术领域技术人员可以理解，除非另外定义，这里使用的所有术语(包括技术术语和科学术语)具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是，诸如通用字典中定义的那些术语应该被理解为具有与现有技术的上下文中的意义一致的意义，并且除非像这里一样定义，不会用理想化或过于正式的含义来解释。It will be understood by one of ordinary skill in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms such as those defined in general dictionaries are to be understood to have meanings consistent with their meaning in the context of the prior art, and are not to be taken in an idealized or overly formal sense unless defined as herein. explain.

为便于对本发明实施例的理解，下面将结合附图以几个具体实施例为例做进一步的解释说明，且各个实施例并不构成对本发明实施例的限定。In order to facilitate understanding of the embodiments of the present invention, several specific embodiments will be further explained below with reference to the accompanying drawings, and each embodiment does not constitute a limitation to the embodiments of the present invention.

由于配电网的建设存在一定程度的滞后，电力设备的容量等裕度偏小，加大了配电网负荷转供的难度，各类突发断电故障需要及时提出转供方案,因此对算法的运算速度以及适用性有更高的要求,而已有算法均存在一定的局限性。现有算法大多在故障发生后进行临时仿真计算分析，很少使用配网运行实时信息大数据，消耗时间较长；或者采取简化仿真过程的方法以加快计算速度，但这样很难兼顾很好的配网运行安全性与经济性。Due to a certain degree of lag in the construction of the distribution network, the capacity margin of power equipment is relatively small, which increases the difficulty of load transfer in the distribution network. Various sudden power outages require timely proposal of transfer plans. Therefore, for There are higher requirements for the algorithm's computing speed and applicability, and existing algorithms have certain limitations. Most of the existing algorithms perform temporary simulation calculation analysis after a fault occurs, and rarely use the distribution network to run real-time information big data, which consumes a long time; or adopt methods to simplify the simulation process to speed up the calculation, but it is difficult to take into account both good and bad results. Distribution network operation safety and economy.

本发明实施例采用强化学习的算法，通过训练与经验学习，在发生故障时，无需消耗大量时间进行仿真计算分析，直接通过分析实时运行大数据而进行负荷转供决策，可以以更快给出更好的转供策略。The embodiment of the present invention adopts a reinforcement learning algorithm. Through training and experience learning, when a fault occurs, there is no need to spend a lot of time on simulation calculation and analysis. The load transfer decision can be made directly by analyzing real-time running big data, which can provide faster results. Better offloading strategies.

以配电网的实时状态信息为输入数据，智能体采用深度强化学习Dueling-DQN算法进行决策并选择动作，动作后转移至新的状态，用约束条件与目标函数对该动作进行评价，并对智能体进行奖励或惩罚，当通过一系列操作完成转供时，停止操作得到最终操作策略。Taking the real-time status information of the distribution network as input data, the agent uses the deep reinforcement learning Dueling-DQN algorithm to make decisions and select actions. After the action, it transfers to a new state, uses constraints and objective functions to evaluate the action, and evaluates the action. The agent rewards or punishes, and when the transfer is completed through a series of operations, it stops the operation to obtain the final operation strategy.

图1是本申请实施例提供的一种负荷转供决策到强化学习的映射关系示意图,下面结合图1对配电网环境与智能体的交互关系进行详细说明。Figure 1 is a schematic diagram of the mapping relationship between load transfer decision-making and reinforcement learning provided by the embodiment of this application. The interactive relationship between the distribution network environment and the intelligent agent will be described in detail below in conjunction with Figure 1.

首先需要定义强化学习中的环境，即配电网负荷转供操作中的系统状态空间、动作空间以及奖励函数。智能体与配电网环境交互由数组[S，A，P(s，s′)，R(s，a)，Done]表示，其中S表示配电网可能的状态所构成的状态空间，A表示可能的动作集合，P(s，s′)表示从配电网状态s转移到s′的转移概率，R(s，a)是在状态s时采取了a动作，触发了相关的奖励，其被反馈给智能体。Done为结束状态的标志位，智能体主动选择终止本次决策或由于违反约束条件而被环境终止继续操作时，Done被设置为1，正常决策步骤时，Done保持为0。First, it is necessary to define the environment in reinforcement learning, that is, the system state space, action space and reward function in the load transfer operation of the distribution network. The interaction between the agent and the distribution network environment is represented by the array [S, A, P(s, s′), R(s, a), Done], where S represents the state space composed of the possible states of the distribution network, A Represents a set of possible actions, P(s, s′) indicates the transition probability from distribution network state s to s′, R(s, a) takes action a in state s, triggering related rewards, It is fed back to the agent. Done is the flag bit of the end state. When the agent actively chooses to terminate this decision-making or is terminated by the environment due to violation of constraints to continue the operation, Done is set to 1. During normal decision-making steps, Done remains 0.

A.状态空间A. State space

B.动作空间B. Action space

面对实时变化的配电网，强化学习的智能体需要在配电网中对开关进行相应的操作，控制配电网的状态。智能体可以根据当前的配电网状态以及奖励函数决定如何执行下一步的动作。动作空间A为一个数字，其范围为0～2N_switch的整数，当动作A为2N_switch时，代表不采取任何操作并退出，本次决策结束；当动作A为0～2N_switch-1时，对A作如下计算：Facing the real-time changing distribution network, the reinforcement learning agent needs to perform corresponding operations on the switches in the distribution network to control the status of the distribution network. The agent can decide how to perform the next action based on the current distribution network status and reward function. Action space A is a number, and its range is an integer from 0 to 2N _switch . When action A is 2N _switch , it means no action is taken and exits. This decision-making is over; when action A is 0 to 2N _switch -1, Make the following calculation for A:

x＝A％2x=A%2

C.奖励函数C. Reward function

智能体对环境配电网采取已选择的动作后，会得到环境对于本次动作的评价，本发明将此评价作为智能体的奖励。奖励的主要分为约束条件部分与目标函数部分，使操作能够在保证正常配电网运行的条件下实现最经济运行成本。After the intelligent agent takes the selected action on the environmental distribution network, it will obtain the environment's evaluation of this action. The present invention uses this evaluation as a reward for the intelligent agent. The reward is mainly divided into the constraint part and the objective function part, so that the operation can achieve the most economical operating cost while ensuring the normal operation of the distribution network.

(1)约束条件：(1) Constraints:

对配电网的操作控制首先要考虑配电网的安全运行与用户用电安全，转供后配电线路各节点的电压、电流质量满足要求，电压应保持在偏差为±7％的容许范围内，对于超出该范围的电压，予以高惩罚P_Volt，并设置结束标志Done为1。The operation and control of the distribution network must first consider the safe operation of the distribution network and the safety of users' electricity consumption. After the transfer, the voltage and current quality of each node of the distribution line meet the requirements, and the voltage should be maintained within the allowable range of ±7%. Within, for voltages outside this range, a high penalty P _Volt is imposed, and the end flag Done is set to 1.

当传输容量超过线路与变压器的极限值时，电力设备将无法保证正常运转，易引发二次故障，因此，本发明取线路与变压器的通过电流与其极限值进行比较，如果超越极限通过电流，视为设备传输容量越限，予以高惩罚P_Lim，并设置结束标志Done为1。When the transmission capacity exceeds the limit value of the line and transformer, the power equipment will not be able to guarantee normal operation and may easily cause secondary faults. Therefore, the present invention compares the passing current of the line and transformer with its limit value. If the passing current exceeds the limit, it is deemed that If the transmission capacity of the device exceeds the limit, a high penalty P _Lim will be imposed, and the end flag Done will be set to 1.

当智能体进行操作后在配电网中出现环网时，可以作为中间过渡状态短时间出现，但不允许作为长期运行状态出现，因此环网惩罚P_Loop应当考虑动作状态。When a loop appears in the distribution network after the agent operates, it can appear for a short time as an intermediate transition state, but is not allowed to appear as a long-term operating state. Therefore, the ring penalty P _Loop should consider the action state.

当智能体采取无效的操作时，如对已经闭合的开关执行闭合动作时，或者对已经打开的开关进行打开动作时，以及对故障打开线路进行动作时，该动作视为无效，给予无效动作惩罚P_Act。When the agent takes an invalid operation, such as performing a closing action on a closed switch, or performing an opening action on an already open switch, or performing an action on a faulty open line, the action is considered invalid and an invalid action penalty will be given. P _act .

(2)目标函数：(2) Objective function:

在动作能够满足约束的条件下，尽可能地恢复下游失电区域的正常供电，因此，根据损失负荷的比例设置负荷损失评价值E_Loadloss。Under the condition that the action can meet the constraints, the normal power supply in the downstream power loss area should be restored as much as possible. Therefore, the load loss evaluation value E _Loadloss is set according to the proportion of the lost load.

其中，L_loss为失电损失负荷值，L_total为整个电力系统负荷总量，计算得到的E_Loadloss值在-2～2之间。Among them, L _loss is the power loss load value, L _total is the total load of the entire power system, and the calculated E _Loadloss value is between -2 and 2.

开关的动作都要对开关的寿命产生影响，开关动作中可能存在部分开关需要人员手动操作，当动作次数过多时，不仅扩大操作失误的概率，用户供电的恢复时间也可能无法满足要求，而且还会使得中压配电网的结构变化过大，在故障消除或检修结束后，给配电网恢复至原运行方式增加更多的难度。因此应该尽量减少对开关的频繁操作，减少开关动作而引起的操作费用，E_Num对动作次数的评价值。The action of the switch will have an impact on the life of the switch. There may be some switches that require manual operation. When the number of actions is too many, it will not only increase the probability of operational errors, but also the recovery time of the user's power supply may not meet the requirements. Moreover, This will cause the structure of the medium-voltage distribution network to change too much, making it more difficult for the distribution network to return to its original operating mode after the fault is eliminated or the maintenance is completed. Therefore, we should try to reduce the frequent operation of the switch and reduce the operating costs caused by the switch action. E _Num is the evaluation value of the number of actions.

其中，A_Num本次决策发生变化的开关总数量，L_Num为开关总数量。计算得到的E_Num值在-1～1之间。Among them, A _Num is the total number of switches that have changed in this decision, and L _Num is the total number of switches. The calculated E _Num value is between -1 and 1.

考虑到配电网的经济运行，在完成动作后，需要对配电网的线损情况进行评估，评估使用带电线路的阻抗模型，E_Loss为线损评价值。Considering the economic operation of the distribution network, after completing the action, it is necessary to evaluate the line loss of the distribution network. The impedance model of the live line is used for evaluation. E _Loss is the line loss evaluation value.

其中，Line为未停电线路总数，I_i为第i条线路的实际电流，R_i为第i条线路与变压器的电阻，S为全网总功率。公式右端为计算得到的近似线损率，由于配电网及基层线损率往往在5％～12％之间，为使E_Loss的值能保持在近似-1～0处，因此将线损率放大-10倍作为线损评价值。Among them, Line is the total number of non-outage lines, I _i is the actual current of the i-th line, R _i is the resistance of the i-th line and the transformer, and S is the total power of the entire network. The right end of the formula is the calculated approximate line loss rate. Since the line loss rate of the distribution network and the base layer is often between 5% and 12%, in order to keep the value of E _Loss at approximately -1 to 0, the line loss is The rate is magnified -10 times as the line loss evaluation value.

对于电压未超出±7％范围的节点，用E_Vot评价值衡量其电压偏移程度，以保证转供后的配电网具备较好的电压质量。For nodes whose voltage does not exceed the range of ±7%, the E _Vot evaluation value is used to measure the degree of voltage deviation to ensure that the distribution network after transfer has good voltage quality.

其中，N为未停电节点总数量，pu_i为节点i的电压标幺值，由于右侧公式计算得到的结果小于0.07，且大部分电压值偏离不超过0.05，因此为使E_Vot的值能保持在近似-1～0处，将其放大20倍。Among them, N is the total number of nodes that are not out of power, and pu _i is the voltage per unit value of node i. Since the result calculated by the formula on the right is less than 0.07, and most voltage values deviate not more than 0.05, in order to make the value of E _Vot Keep it at approximately -1 to 0 and magnify it 20 times.

环境给出的奖励函数由以上各评价值的总和构成，即Reward。The reward function given by the environment is composed of the sum of the above evaluation values, namely Reward.

Reward＝P_Volt+P_Lim+P_Loop+P_Act+E_Loadloss+E_Num+E_Loss+E_Vot Reward＝P _Volt +P _Lim +P _Loop +P _Act +E _Loadloss +E _Num +E _Loss +E _Vot

D.结束条件D.End condition

如果动作造成电压越限或设备传输容量越限，该动作回合会被强制结束，视为动作失败，结束标志Done＝1；如果动作以后的配电网恢复了全部无故障区域的负荷，而且没有电压越限或设备传输容量越限的情况，该动作回合会被环境判断为已经完成转供，当前回合自动结束，结束标志Done＝1；但特殊情况下如某些联络线容量不足的情况下，需要通过切除无故障失电负荷以保证供电质量，或者存在多处故障导致无法进行转供，此时环境无法通过恢复所有非故障负荷判断转供是否完成，在智能体认为当前状态没有更好的动作时，智能体可以自行选择结束当前回合并退出，结束标志Done＝1。除此之外的其他情况，Done＝0，以使智能体继续执行动作。If the action causes the voltage to exceed the limit or the equipment transmission capacity to exceed the limit, the action round will be forcibly ended and the action will be deemed to have failed. The end flag is Done = 1; if the distribution network after the action restores the load of all fault-free areas, and there is no If the voltage exceeds the limit or the equipment transmission capacity exceeds the limit, the action round will be judged by the environment as having completed the supply transfer, and the current round will automatically end with the end flag Done = 1; but in special circumstances, such as when some tie lines have insufficient capacity , it is necessary to remove the fault-free load to ensure the quality of power supply, or there are multiple faults that prevent power transfer. At this time, the environment cannot judge whether the power transfer is completed by restoring all non-faulty loads. When the agent believes that the current state is not better When taking action, the agent can choose to end the current round and exit, and the end flag is Done=1. In other cases, Done=0, so that the agent can continue to perform actions.

本发明实施例提供的一种基于深度强化学习的配电网负荷转供方法的处理流程图如图3所示，包括如下的处理步骤：The processing flow chart of a distribution network load transfer method based on deep reinforcement learning provided by the embodiment of the present invention is shown in Figure 3, and includes the following processing steps:

步骤1、初始化主神经网络Q的参数ω，α，β与目标网络T的参数ω^*，α^*，β^*，初始化经验经验池R、折扣因子γ、学习率L_r、目标网络更新频率N_replace、抽样数量N_batch，Done＝0。Step 1. Initialize the parameters ω, α, β of the main neural network Q and the parameters ω ^* , α ^* , β ^* of the target network T, initialize the experience pool R, discount factor γ, learning rate L _r , and target network update frequency N _replace , sampling number N _batch , Done=0.

初始化阶段，除了初始化主神经网络Q(S，A，ω，α，β)外，还需要另一个与Q网络结构完全一样的目标网络T(S，A，ω^*，α^*，β^*)，该网络的作用主要是用来求误差以供主神经网络进行学习。In the initialization phase, in addition to initializing the main neural network Q (S, A, ω, α, β), another target network T (S, A, ω ^* , α ^* , β ^* ) with the same structure as the Q network is also needed. , the function of this network is mainly to find the error for the main neural network to learn.

步骤2、配电网发生故障，开始负荷转供。Step 2: If a fault occurs in the distribution network, load transfer begins.

步骤3、读取配电网节点电压标幺值、线路电流、开关开合状态、开关故障状态等实时状态信息，经过处理后得到状态向量S，并输入主神经网络Q，智能体通过DuelingDQN算法计算出动作评价向量。Step 3. Read real-time status information such as distribution network node voltage per unit, line current, switch opening and closing status, switch failure status, etc. After processing, the status vector S is obtained and input into the main neural network Q. The agent passes the DuelingDQN algorithm Calculate the action evaluation vector.

本发明实施例中的基于深度强化学习的配电网负荷转供方法中的智能体的可以采用Deep Q Network及其进化算法DoubleDQN、DuelingDQN，经过比较及测试，DuelingDQN算法在负荷转供的决策过程中表现最优，因此本发明将介绍使用DuelingDQN算法的强化学习智能体模型。The intelligent agent in the distribution network load transfer method based on deep reinforcement learning in the embodiment of the present invention can use Deep Q Network and its evolutionary algorithms DoubleDQN and DuelingDQN. After comparison and testing, the DuelingDQN algorithm is used in the decision-making process of load transfer. has the best performance, so this article will introduce the reinforcement learning agent model using the DuelingDQN algorithm.

DuelingDQN算法使用深度神经网络对获得Q-learning中所有动作的Q值，其深度神经网络部分具备对动作进行评价以及训练学习的能力，其神经网络结构如图2所示。The DuelingDQN algorithm uses a deep neural network to obtain the Q values of all actions in Q-learning. Its deep neural network part has the ability to evaluate actions and train and learn. Its neural network structure is shown in Figure 2.

本发明的DuelingDQN算法中的深度神经网络部分中，价值函数V与优势函数A的公共隐藏层采用了2层神经网络，用来提取输入状态量的特征，第一层有30*N_feature个神经元，其中N_feature为输入状态量的个数，所有神经元直接接受状态数据的全连接输入，并添加了偏置(bias)，激活函数为Relu函数；第二层与第一层进行全连接，与第一层结构类似有30*N_feature个神经元。In the deep neural network part of the DuelingDQN algorithm of the present invention, the common hidden layer of the value function V and the advantage function A adopts a 2-layer neural network to extract the characteristics of the input state quantity. The first layer has 30*N _feature neurons. element, where N _feature is the number of input state quantities. All neurons directly accept fully connected input of state data, and a bias is added. The activation function is the Relu function; the second layer is fully connected to the first layer. , similar to the first layer structure, there are 30*N _feature neurons.

价值函数神经网络与优势函数神经网络各有2层，第一层与分别公共隐藏层的输出进行全连接，有30*N_feature个神经元，并添加了偏置，激活函数为Relu函数；价值函数V的第二层有1个神经元，与第一层进行全连接，有偏置但没有激活函数，直接输出结果。优势函数A第二层与第一层进行全连接，有N_action个神经元，同样直接输出结果，最后利用上式对两神经网络的输出结果进行计算得到最终Q值。The value function neural network and the advantage function neural network each have 2 layers. The first layer is fully connected to the output of the common hidden layer. There are 30*N _feature neurons, and a bias is added. The activation function is the Relu function; value The second layer of function V has 1 neuron, which is fully connected to the first layer, has a bias but no activation function, and directly outputs the result. The second layer of advantage function A is fully connected to the first layer, with N _action neurons, and the results are also directly output. Finally, the output results of the two neural networks are calculated using the above formula to obtain the final Q value.

DuelingDQN对于DQN算法的优化体现在，Dueling DQN考虑将Q网络分成两部分，第一部分是仅仅与状态S有关，与具体要采用的动作A无关，这部分我们叫做价值函数(ValueFunction)部分，其为一个标量，记做V(S，ω，α),第二部分同时与状态状态S和动作A有关，这部分叫做优势函数(Advantage Function)部分,其为长度为动作数量的一个向量，记为B(S，A，w，β),那么最终每个动作的评价值的计算公式为：DuelingDQN's optimization of the DQN algorithm is reflected in the fact that DuelingDQN considers dividing the Q network into two parts. The first part is only related to the state S and has nothing to do with the specific action A to be taken. This part is called the value function (ValueFunction) part, which is A scalar, denoted as V (S, ω, α). The second part is related to both the state S and the action A. This part is called the Advantage Function part, which is a vector whose length is the number of actions, denoted as B(S, A, w, β), then the final calculation formula for the evaluation value of each action is:

其中，ω是公共部分的网络参数，而α是价值函数独有部分的网络参数，而β是优势函数独有部分的网络参数。最终Q网络的输出由价格函数网络的输出和优势函数网络的输出线性组合得到，可以直接评价本次的动作价值，但是这个式子无法辨识最终输出里面V(S，ω，α)和B(S，A，ω，β)各自的作用，为了体现这种可辨识性(identifiability)，对优势函数部分做了中心化的处理，实际使用的组合公式如下：Among them, ω is the network parameter of the common part, α is the network parameter of the unique part of the value function, and β is the network parameter of the unique part of the advantage function. The final output of the Q network is obtained by a linear combination of the output of the price function network and the output of the advantage function network. The value of this action can be directly evaluated. However, this formula cannot identify V(S, ω, α) and B( in the final output. S, A, ω, β). In order to reflect this identifiability, the advantage function part has been centralized. The actual combination formula used is as follows:

其中表示所有动作的集合，/>即求该集合中元素的个数，式子右侧用原向量A全部减去了向量的元素平均值，得到新的优势函数A。使用上式计算得到的Q(S，A，ω，α，β)为一个长度为动作数的向量，其中的每个元素代表该状态S下每个动作的评价值。in Represents the set of all actions,/> That is, to find the number of elements in the set, the right side of the formula subtracts the average value of the vector elements from the original vector A to obtain the new dominance function A. Q(S, A, ω, α, β) calculated using the above formula is a vector with a length of the number of actions, in which each element represents the evaluation value of each action in the state S.

与主神经网络Q结构相同的目标网络T的作用是为了克服样本的随机波动性导致训练过程中的震荡问题，使用两个结构相同但是参数不同的深度神经网络T与Q，Q网络有最新的参数，在每次学习时都要进行更新，而T网络在经过N_replace次动作后，T网络才进行一次更新。The purpose of the target network T with the same structure as the main neural network Q is to overcome the oscillation problem during the training process caused by the random fluctuation of samples. Two deep neural networks T and Q with the same structure but different parameters are used. The Q network has the latest Parameters must be updated every time they learn, and the T network only updates once after N _replace actions.

步骤4、智能体基于动作评价向量根据动作策略选取相应的动作，非探索模式选择最优动作，即评价值Q最高的动作；探索模式则依据ε-greedy选择最优动作或随机动作a。Step 4. The agent selects the corresponding action according to the action strategy based on the action evaluation vector. The non-exploration mode selects the optimal action, that is, the action with the highest evaluation value Q; the exploration mode selects the optimal action or random action a based on ε-greedy.

在训练中的智能体中，为了使智能体具备跳出局部最优解，进行全局探索的能力，采取ε-greedy随机贪婪策略，即取随机数x，若x＜ε，则选择评价值Q最高的动作作为本次动作；若x＞ε则从所有动作中选择一个随机动作。并且ε随着训练的回合数不断增加，训练次数足够多时，深度神经网络中的参数几乎不再发生变化，此时ε为1，每次都选择最佳动作。Among the agents under training, in order to enable the agent to have the ability to jump out of the local optimal solution and conduct global exploration, an ε-greedy random greedy strategy is adopted, that is, a random number x is taken. If x < ε, the highest evaluation value Q is selected. The action is regarded as this action; if x>ε, a random action is selected from all actions. And ε continues to increase with the number of training rounds. When the number of training rounds is sufficient, the parameters in the deep neural network almost no longer change. At this time, ε is 1, and the best action is selected every time.

步骤5、环境执行该动作，得到动作后状态S′，对动作及动作后的状态进行评价,根据约束条件与目标函数计算奖励Reward，由结束规则确定Done的值，完成一次配电网开关动作后，将本次配电网开关动作作为经验样本e＝(s，a，r，s′)存储在经验池R中。Step 5. The environment executes the action and obtains the post-action state S′. It evaluates the action and the post-action state, calculates the reward Reward based on the constraints and objective function, determines the value of Done based on the end rule, and completes a distribution network switching action. Finally, this distribution network switching action is stored in the experience pool R as an experience sample e=(s, a, r, s′).

步骤6、从经验池R中随机采样N_batch个经验样本e_i＝(s_i，a_i，r_i，s′_i)，通常N_batch＝20，计算目标值通过最小化损失函数对主神经网络Q(S，A，ω，α，β)中的参数ω，α，β进行更新，以RMSProp算法求参数的更新程度，该算法的参数学习率L_r决定了参数更新的程度，即神经网络的学习速度，其值通常为为0.001，对主神经网络Q的一次更新代表了一次智能体的学习过程。Step 6. Randomly sample N _batches of experience samples e _i = (s _i , a _i , r _i , s′ _i ) from the experience pool R, usually N _batch =20, and calculate the target value By minimizing the loss function Update the parameters ω, α, β in the main neural network Q (S, A, ω, α, β), and use the RMSProp algorithm to find the degree of parameter update. The parameter learning rate L _r of the algorithm determines the degree of parameter update. , that is, the learning speed of the neural network, its value is usually 0.001. An update to the main neural network Q represents a learning process of the agent.

步骤7、每当主神经网络经过N_replace次更新时，通常N_replace＝200，使用主神经网络Q的参数ω，α，β对目标网络T的参数ω^*，α^*，β^*进行更新：Step 7. Whenever the main neural network is updated N _replace times, usually N _replace =200, the parameters ω, α, β of the main neural network Q are used to update the parameters ω ^* , α ^* , β ^* of the target network T:

ω^*，α^*，β^*←ω，α，βω ^* , α ^* , β ^* ←ω, α, β

步骤8、依据结束标志位Done对是否结束序列动作进行判断。Done＝0，返回步骤4；Done＝1，退出循环，本次负荷转供决策结束，进入下一步。Step 8: Determine whether to end the sequence action based on the end flag bit Done. Done=0, return to step 4; Done=1, exit the loop, this load transfer decision is completed, and enter the next step.

以上为一次单步动作，而一次完整的负荷转供很可能由多次具有先后顺序的开关动作组成。所以依据结束标志位Done对是否结束序列动作进行判断。若Done＝0，代表该配电网还需要继续动作以完成转供，则重新读取配电网实时运行信息，将新的状态量输入Q网络重新进行计算，进入下一个动作决策过程；若Done＝1，本次动作决策停止，The above is a single-step action, and a complete load transfer is likely to consist of multiple sequential switching actions. Therefore, it is judged whether to end the sequence action based on the end flag bit Done. If Done=0, it means that the distribution network still needs to continue to act to complete the supply transfer, then re-read the real-time operation information of the distribution network, input the new state quantity into the Q network for recalculation, and enter the next action decision-making process; if Done=1, this action decision is stopped,

步骤9、等待下次配电网发生故障，进入新的负荷转供决策过程,转到步骤2。Step 9: Wait for the next distribution network failure to enter a new load transfer decision-making process and go to step 2.

综上所述，本申请提供一种基于深度强化学习的配电网负荷转供方法,采用了配电网的实时运行数据来进行负荷转供决策，利用深度强化学习来提高配电网的故障应急恢复能力与可靠性，在保证配电网安全稳定运行、用户用电安全的条件下，最大限度的达到了电压质量、配电网操作与运行的经济性的多方面最优。同时基于深度强化学习的配电网负荷转供算法避免了故障时的大量运算与电网仿真迭代，提高了负荷转供的速度，减短了非故障区域停电的时间，使配电网具有更高的可靠性。In summary, this application provides a distribution network load transfer method based on deep reinforcement learning, which uses real-time operating data of the distribution network to make load transfer decisions, and uses deep reinforcement learning to improve distribution network faults. Emergency recovery capabilities and reliability, while ensuring the safe and stable operation of the distribution network and the safety of users' electricity consumption, maximize the multi-faceted optimization of voltage quality, distribution network operation and economic efficiency. At the same time, the distribution network load transfer algorithm based on deep reinforcement learning avoids a large number of calculations and grid simulation iterations during faults, improves the speed of load transfer, shortens the power outage time in non-fault areas, and makes the distribution network more efficient. reliability.

本发明使用强化学习算法为Dueling-DQN算法，相对于Q学习算法、DQN算法等常用的强化学习算法，对配电网的状态特征识别更加精确，能够达到更加准确的负荷转供决策方案。The present invention uses a reinforcement learning algorithm called Dueling-DQN algorithm. Compared with commonly used reinforcement learning algorithms such as Q-learning algorithm and DQN algorithm, it can more accurately identify the status characteristics of the distribution network and achieve a more accurate load transfer decision-making scheme.

本发明实施例通过本发明通过强化学习人工智能算法，获得运行配电网的实时信息分析用于负荷转供决策，能够在短时间内给出最佳的控制策略。采用强化学习的算法，通过训练与经验学习，在发生故障时，无需消耗大量时间进行仿真计算分析，直接通过分析实时运行大数据而进行负荷转供决策，可以以更快给出更好的转供策略。In the embodiment of the present invention, the present invention uses the reinforcement learning artificial intelligence algorithm to obtain real-time information analysis of the operating distribution network for load transfer decision-making, and can provide the best control strategy in a short time. Using reinforcement learning algorithms, through training and experience learning, when a fault occurs, there is no need to spend a lot of time on simulation calculations and analysis, and load transfer decisions can be made directly by analyzing real-time running big data, which can provide faster and better transfer decisions. Provide strategies.

本领域普通技术人员可以理解：附图只是一个实施例的示意图，附图中的模块或流程并不一定是实施本发明所必须的。Those of ordinary skill in the art can understand that the accompanying drawing is only a schematic diagram of an embodiment, and the modules or processes in the accompanying drawing are not necessarily necessary for implementing the present invention.

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。From the above description of the embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product can be stored in a storage medium, such as ROM/RAM, disk , optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments of the present invention.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于装置或系统实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的装置及系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。Each embodiment in this specification is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, the device or system embodiments are described simply because they are basically similar to the method embodiments. For relevant details, please refer to the partial description of the method embodiments. The device and system embodiments described above are only illustrative, in which the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, It can be located in one place, or it can be distributed over multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above are only preferred specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person familiar with the technical field can easily think of changes or modifications within the technical scope disclosed in the present invention. All substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The power distribution network load transfer method based on deep reinforcement learning is characterized by comprising the following steps of:

step 1, initializing a main neural network Q (S, a, ω, α, β) and a target network T (S, a, ω, α, β) identical to the network structure of the main neural network Q, initializing an experience pool R, a discount factor γ, a learning rate Lr, and a target network update frequency N _replace Number of samples N _batch Setting a zone bit Done=0 of an ending state, and forming an intelligent body of the power distribution network by the main neural network Q, the target network T and the experience pool;

the cost function V in the main neural network Q and the target network T is related to the state S, and is independent of the action a, which is a scalar, denoted as V (S, ω, α), the dominance function B is related to both the state S and the action a, which is a vector of length, denoted as the number of actions, denoted as B (S, a, w, β), and the cost function of the agent is expressed as:

Q(s，A，ω，α，β)＝V(S，ω，α)+B(S，A，ω，β)，

wherein ω is a network parameter of the public part, α is a network parameter of the unique part of the cost function, β is a network parameter of the unique part of the dominance function, and the output of the final Q network is obtained by linearly combining the output of the cost function network and the output of the dominance function network;

the dominant function part is subjected to centering treatment, and a combination formula which is practically used is as follows:

Wherein the method comprises the steps ofRepresenting a set of all actions, +.>Namely, the number of elements in the set is calculated, Q (S, A, omega, alpha, beta) obtained by using the above formula is a vector with the length of the action number, wherein each element represents the evaluation value of each action in the state S;

step 2, the distribution network fails and load transfer is started;

step 3, reading real-time state information of the power distribution network, inputting the real-time state information of the power distribution network to the intelligent body, and calculating an evaluation value of each action by the intelligent body according to the real-time state information of the power distribution network;

step 4, the agent selects corresponding actions according to action strategies based on the evaluation value of each action;

step 5, the intelligent agent executes the action on the power distribution network to obtain a state S 'of the power distribution network after the action, evaluates the action of the power distribution network and the state after the action, calculates a Reward report according to constraint conditions and an objective function, determines a Done value according to the Reward report and an ending rule, and stores the switching action of the power distribution network as an experience sample e= (S, a, R, S') in an experience pool R after one switching action of the power distribution network is completed;

step 6, randomly sampling the sampling number N from the experience pool R _batch Calculating a target value by using the discount factor gamma according to the sampled empirical samples, and updating parameters omega, alpha, beta in a main neural network Q (S, A, omega, alpha, beta) by minimizing a loss function based on the target value and a learning rate Lr;

step 7, when the main neural network passes through N _replace After the secondary update, the parameters ω, α, β of the primary neural network Q are used to update the parameters ω, α, β of the target network T:

step 8, judging whether to end the sequence action according to the end flag bit Done, wherein done=0, and returning to the step 4; done=1, exiting the cycle, and ending the load transfer process of the power distribution network;

the step 1 further comprises:

defining a system state space, an action space and a reward function in the load transfer operation of the power distribution network, wherein the interaction of an agent and the power distribution network environment is represented by an array [ S, A, P (S, S '), R (S, a), done ] wherein S represents a state space formed by possible states of the power distribution network, A represents a possible action set, P (S, S ') represents a transition probability of a transition from the state S to the state S ' of the power distribution network, R (S, a) takes an action a in the state S, triggers related rewards, is fed back to the agent, done is a flag bit of an ending state, and is set to 1 when the agent actively selects to terminate the decision or is terminated by the environment for continuous operation due to violation of constraint conditions, and Done is kept to 0 in the normal decision step;

The state space is defined as an array s= [ V, I, SW, F]V is a voltage vector set, which is used to represent the voltage values of all phases at each node in the power distribution network, vin is the voltage value of the nth phase of the ith node; i is a current vector group which is used for representing current values of each phase in all lines in the power distribution network, and Iin is a current value of an nth phase of an ith line; SW is a state value vector of all switches in the power distribution network, SW _i The state of the ith switch is 0, which indicates open, and 1, which indicates close; f is a vector representing the fault state of the power distribution network line, F _i A failure state of the line numbered i, 0 indicates normal, and 1 indicates failure.

2. The method of claim 1, wherein said step 1 further comprises:

the intelligent agent adopts a Dueling-DQN algorithm, the Dueling-DQN algorithm utilizes a deep neural network to calculate, the deep neural network comprises a main neural network Q and a target network T, and the main neural network Q and the target network T comprise: a public hidden layer, a cost function V and a dominance function B;

the common hidden layer of the cost function V and the dominance function B adopts a 2-layer neural network for extracting the characteristics of the input state quantity, and the first layer is 30 x N _feature Neurons of which N _feature For inputting the number of state quantities, all neurons directly receive full-connection input of state data, bias is added, and an activation function is a Relu function; the second layer is fully connected with the first layer, and also has a total connection ratio of 30 x N _feature Spirit of individualMenstruation;

and the intelligent agent calculates the output results of the main neural network Q and the target network T by adopting a lasting-DQN algorithm, and calculates the evaluation value of each action.

3. The method according to claim 2, wherein the agent in step 4 selects a corresponding action according to an action policy based on the evaluation value of each action, including:

the intelligent agent selects corresponding actions according to action strategies based on the action evaluation vectors, and selects optimal actions in a non-exploration mode, wherein the optimal actions are actions with highest evaluation values Q; adopting an epsilon-greedy random greedy strategy in the exploration mode, namely taking a random number x, and selecting the action with the highest evaluation value Q as the current action if x is less than epsilon; if x > epsilon, a random action is selected from all actions, and epsilon is a set parameter.

4. A method according to claim 3, wherein said agent in step 5 performs said actions on the distribution network, comprising:

The action A is a number, the range of the action A is an integer of 0-2 NSwitch, when the action A is 2NSwitch, no operation is taken and the action A exits, and the decision is ended; when the action A is 0-2 NSwitch-1, the following calculation is performed on the action A:

x＝A％2

wherein x is the remainder of dividing A by 2, the formula having the following meaning:

，

each action is to operate one switch or directly exit, if the switch exits, the decision is ended.

5. The method according to claim 4, wherein the step 5 of obtaining the state S' of the distribution network after the action, evaluating the action and the state after the action of the distribution network, calculating the Reward according to the constraint condition and the objective function, and determining the value of Done according to the Reward and the ending rule includes:

the constraint condition of the power distribution network is set, which comprises the following steps:

the voltage is kept within the tolerance range with the deviation of + -7%, and a voltage penalty value P is set for the voltage exceeding the tolerance range _Volt = -10 and set the end flag Done to 1; for voltages not outside this range, a voltage penalty value P is set _Volt ＝0；

When the passing current of the line and the transformer is larger than the limit value, a current penalty value P is set _Lim = -10 and set the end flag Done to 1; for currents not exceeding a limit value, a current penalty value P is set _Lim ＝0；

The ring network punishment P Loop of the intelligent agent is set as follows:

setting invalid action penalty P for an agent _Act The method comprises the following steps:

setting an objective function of the power distribution network comprises:

setting a load loss evaluation value E according to the proportion of the lost load _Loadloss ：

Wherein L is _loss For loss of power load value, L _total E calculated for the total load of the whole power system _Loadloss The value is between-2 and 2;

evaluation value E of the number of times of switch operation _Num ：

Wherein A is _Num The total number of switches, L, of which the decision is changed _Num For the total number of switches, calculate E _Num The value is between-1 and 1;

evaluation value E of line loss condition of power distribution network _Loss ：

Wherein Line is the total number of uninterrupted lines, I _i R is the actual current of the ith line _i The resistance of the ith line and the transformer is shown, and S is the total power of the whole network;

for the nodes whose voltages do not exceed the range of + -7%, the voltage deviation degree evaluation value E of the line _Vot ：

Wherein N is the total number of uninterrupted nodes, pu _i The per-unit value of the voltage of the node i;

the Reward function given by the environment is composed of the sum of the above evaluation values, namely, report:

Reward＝P _Volt +P _Lim +P _Loop +P _Act +E _Loadloss +E _Num +E _Loss +E _Vot 。

6. the method according to any one of claims 2 to 5, wherein said step 6 randomly samples said number of samples N from a pool of experience R _batch A plurality of empirical samples, calculating a target value based on the target value and the learning rate by using the discount factor gamma based on the sampled empirical samples Lr updates parameters ω, α, β in the main neural network Q (S, a, ω, α, β) by minimizing the loss function, including:

random sampling N from experience pool R _batch Sample e of experience _i ＝(si，ai，ri，s′i)，N _batch =20, calculate target value

By minimizing the loss function

Updating parameters omega, alpha and beta in a main neural network Q (S, A, omega, alpha and beta), calculating the updating degree of the parameters by using an RMSProp algorithm, and obtaining a learning rate L _r At 0.1, one update to the primary neural network Q represents a learning process of an agent.