CN116454926A

CN116454926A - Multi-type resource cooperative regulation and control method for three-phase unbalanced management of distribution network

Info

Publication number: CN116454926A
Application number: CN202310696501.7A
Authority: CN
Inventors: 李佳勇; 海征; 陈大波; 张聪; 朱利鹏; 帅智康
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-07-18
Anticipated expiration: 2043-06-13
Also published as: CN116454926B

Abstract

A multi-type resource collaborative regulation and control method for three-phase imbalance treatment of a distribution network belongs to the technical field of three-phase imbalance treatment of a distribution network, and comprises the steps of S1 setting a quintuple set as a construction model coordinate; s2, constructing a Markov decision model, and solving the Markov decision model to obtain a control strategy of the parallel capacitor bank and the phase change switch; s3, constructing a Markov game model, and solving the Markov game model to enable the selected intelligent agent to selectively pay attention to information of the non-selected intelligent agent in the Q value estimation model; s4, constructing an upper-layer agent of a Markov decision model and a lower-layer agent of a Markov game model by adopting two-step collaborative training. The control method solves the problems that the existing control technology based on the physical model excessively depends on refined modeling and is difficult to be suitable for on-line treatment of three-phase unbalance of a part of observable power distribution network, and the current unbalance degree compensation and the voltage unbalance treatment effect of the power distribution network are remarkably improved.

Description

Multi-type resource cooperative regulation and control method for three-phase unbalanced management of distribution network

Technical Field

The invention belongs to the technical field of three-phase unbalance management of a power distribution network, relates to a multi-type resource cooperative regulation and control method for three-phase unbalanced management of a distribution network.

Background

At present, the access of high-proportion distributed new energy into the power distribution network has become an important direction of energy development in China. However, the great increase of the permeability of the new energy not only can cause frequent voltage fluctuation of the power distribution network, but also can cause generation of three-phase unbalanced current and voltage. In addition, the three-phase unbalance phenomenon of the power distribution network can be aggravated by injection power generated by a large number of single-phase distributed photovoltaic power sources which are connected in a scattered manner, and the safe and reliable operation of the power distribution network is seriously jeopardized.

In the prior art, the voltage of the power distribution network is controlled in real time only by focusing on the capability of quickly adjusting the reactive power of the photovoltaic inverter, and the inherent asymmetric property of the power distribution network and the distributed synergistic effect of the parallel capacitor bank, the phase change switch and the photovoltaic inverter are ignored, so that the adjusting capability of each type of controllable equipment cannot be fully invoked to improve the unbalanced phenomenon of the three phases of the voltage and the current of the power distribution network and improve the electric energy quality.

Disclosure of Invention

In order to achieve the purpose, the invention provides a multi-type resource collaborative regulation and control method for three-phase unbalance management of a power distribution network, which solves the problems that the existing control technology based on a physical model is excessively dependent on fine modeling and is difficult to be suitable for partially observable three-phase unbalance online management of the power distribution network, and remarkably improves the current unbalance degree compensation and voltage unbalance management effects of the power distribution network.

The technical scheme adopted by the invention is as follows:

the first aspect of the embodiment of the invention provides a multi-type resource collaborative regulation and control method for three-phase imbalance treatment of a distribution network, which comprises the following steps: s1 setting five-tuple setAs construction model coordinates; s2, constructing a Markov decision model, and solving the Markov decision model by adopting a first calculation method to obtain a control strategy of the parallel capacitor bank and the phase change switch; s3, constructing a Markov game model, and solving the Markov game model by adopting a second calculation method to enable the selected agent to selectively pay attention to information of non-selected agents in the Q value estimation model; s4, constructing an upper-layer agent of a Markov decision model and a lower-layer agent of a Markov game model by adopting two-step collaborative training.

The first calculation method adopts a depth neural network fitting function DQN to obtain an optimal control strategy of the parallel capacitor bank and the phase change switch;

the second calculation method adopts a multi-attention action-evaluation MAAC which introduces an attention mechanism to solve the Markov game, so that the selected agent selectively pays attention to the related information of the non-selected agent in the Q value estimation process, and the calculation complexity and the storage space are reduced;

the two-step method adopts a multi-time scale control method to cooperatively train an upper-layer agent for constructing a Markov decision model and a lower-layer agent for constructing a Markov game model, so that the parallel capacitor bank, the phase change switch and the photovoltaic inverter cooperatively act.

Further, constructing a Markov decision model from the set of five-tuple comprises: state spaceSpace of actionBonus function->State transition probability function->、The method comprises the steps of carrying out a first treatment on the surface of the S2.1 setting State space->Active power and reactive power of all nodes of the power distribution network, active power of photovoltaic equipment and node voltage amplitude; s2.2 setting action space +.>The device comprises a parallel capacitor bank and an action instruction of a phase change switch; s2.3 setting a reward functionThe method comprises the steps of adding zero sequence and negative sequence current components through a transmission and distribution connection node, and obtaining a voltage out-of-limit penalty value and a voltage unbalance out-of-limit penalty value; s2.4 setting a state transition probability function +.>The method comprises the steps of carrying out a first treatment on the surface of the Status space->Space of actionBonus function->The characterized upper level agent and is used to maximize the cumulative discount rewards.

Further, the Markov decision model solving comprises; s2.5, fitting an action cost function according to the deep neural network; given stateTake action->Policy based->The specific process of continuously interacting with the environment to obtain the desired rewards may define an action-cost function as a Q function, such as:

；

wherein ,expressed as policy->Lower expected value->For discounts factor->Weight parameters to be optimized for the Q network, t being denoted as time t, < >>A bonus function value expressed as time t;

and solving the Markov decision model, and selecting the action with the maximum Q value by the intelligent agent according to the predicted Q value, and taking effect at the preset next moment.

Further, the Markov decision model solving further comprises; s2.6, applying a target Q network and an experience playback mechanism; s2.7 updating parameters of the loss function with the Adam optimizerWherein, all evaluation networks can be iteratively updated by minimizing a joint regression loss function, the loss function being:

；

wherein ,expressed as expected value，Expressed as target Q value, +.>For rewarding function value->For discounts factor->For the weight parameter of the target Q network, +.>Expressed as predicted Q value;

s2.8 use ofGreedy policies select actions of the Q network.

Further, the Markov game model comprises; s3.1 setting State spaceState space->Comprising distribution network areas->Active power and reactive power of all nodes in the photovoltaic system, active power and reactive power of the photovoltaic system, node voltage amplitude and time +.>Status information of the internal parallel capacitor bank and the phase change switch; s3.2 setting action space->The reactive power output value of each photovoltaic inverter in the area is calculated; s3.3 setting a reward function->A reward function for representing underlying multi-agent sharing; every +.>And at time intervals, obtaining a corresponding action strategy through each intelligent agent according to the local state information in the area, then carrying out load flow calculation of the three-phase asymmetric power distribution network to obtain measurement information such as voltage amplitude values of all nodes, finally calculating a reward function value at the current moment on the basis, and transferring the three-phase asymmetric power distribution network to the next moment.

Further, solving the Markov game model includes; s3.4, considering local observation state information and action information of the intelligent agent, and considering contribution degree of local information of other intelligent agents; s3.5 based on three trainable parameter sharing matrices, all evaluation networks can be iteratively updated by minimizing the joint regression loss function: s3.6, each agent can update the parameters of the own action network based on the gradient strategy; and S3.7, updating target network parameters so that each intelligent agent selectively pays attention to related information of other intelligent agents in the Q value estimation process.

The beneficial effects of the invention are as follows: setting a five-tuple set by S1As construction model coordinates; s2, constructing a Markov decision model, and solving the Markov decision model by adopting a first calculation method to obtain a control strategy of the parallel capacitor bank and the phase change switch; s3, constructing a Markov game model, and solving the Markov game model by adopting a second calculation method to enable the selected agent to selectively pay attention to information of non-selected agents in the Q value estimation model; s4, constructing an upper-layer agent of a Markov decision model and a lower-layer agent of a Markov game model by adopting two-step collaborative training. The method solves the problems that the existing control technology based on the physical model is excessively dependent on refined modeling and is difficult to be suitable for on-line treatment of three-phase unbalance of a part of observable power distribution network, and the current unbalance compensation of the power distribution network is obviously improvedAnd a voltage unbalance treatment effect.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for collaborative regulation of multiple types of resources for three-phase imbalance treatment of a distribution network according to an embodiment of the present invention;

FIG. 2 is a diagram of a multi-type resource collaborative regulation framework for three-phase imbalance management of a power distribution network according to an embodiment of the present invention;

FIG. 3 is a diagram of a DQN method network architecture according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the attention mechanism of the agent Q function according to an embodiment of the present invention;

FIG. 5 is a flow chart of a top level agent training provided in an embodiment of the present invention;

FIG. 6 is a flow chart of an underlying multi-agent training process according to one embodiment of the present invention;

FIG. 7 is a flowchart of an implementation strategy of a multi-time scale control method according to an embodiment of the present invention;

FIG. 8a is a plot of amplitude versus frequency of phase voltage at node a according to one embodiment of the present invention;

FIG. 8b is a plot of amplitude versus frequency for another phase of voltage at node a according to one embodiment of the present invention;

FIG. 9a is a plot of node b phase voltage magnitude versus frequency for an embodiment of the present invention;

FIG. 9b is a plot of amplitude versus frequency for another phase of voltage at node b according to one embodiment of the present invention;

FIG. 10a is a plot of the amplitude versus frequency of the phase-c voltage at node according to one embodiment of the present invention;

FIG. 10b is a plot of amplitude versus frequency for another phase of voltage at node c according to one embodiment of the present invention;

FIG. 11a is a plot of voltage imbalance frequency provided by an embodiment of the present invention;

FIG. 11b is a plot of another voltage imbalance frequency provided by an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart of a method for collaborative control of multiple types of resources for three-phase imbalance treatment of a distribution network according to an embodiment of the present invention; the first aspect of the embodiment of the invention provides a multi-type resource collaborative regulation and control method for three-phase imbalance treatment of a distribution network, which comprises the following steps: s1 setting five-tuple setAs construction model coordinates; s2, constructing a Markov decision model, and solving the Markov decision model by adopting a first calculation method to obtain a control strategy of the parallel capacitor bank and the phase change switch; s3, constructing a Markov game model, and solving the Markov game model by adopting a second calculation method to enable the selected agent to selectively pay attention to information of non-selected agents in the Q value estimation model; s4, constructing an upper-layer agent of a Markov decision model and a lower-layer agent of a Markov game model by adopting two-step collaborative training.

In the present embodiment, the functionAs a function of the characterization of a usable deep neural network, the neural network may be referred to as a Q network. In this network, the output is a real number, called the Q value, which is expressed as the long-term jackpot value that an agent can achieve in a certain state to take a certain action.

In this embodiment, the five-tuple includes a five-tuple composed of a state space, an action space, a reward function, a state transition probability function, and a discount factor of the upper-layer agent.

Referring to fig. 2, the following description will be made, and fig. 2 is a schematic diagram of a multi-type resource collaborative regulation framework for three-phase imbalance treatment of a power distribution network according to an embodiment of the present invention.

In this embodiment, step 2 is a sub-step of long-time scale control, that is, step S2 of constructing a markov decision model, and solving the markov decision model by using a first calculation method is a sub-step of long-time scale control; the control according to the long time scale may comprise the steps of:

specifically, according to the discreteness of the parallel capacitor bank and the phase change switch action mode, the control problem is modeled as a Markov decision process, and is described by a five-tuple:。

illustratively, the five-tuple includes: state spaceRepresenting a set of upper level agent state spaces. At preset time or experimental result time t, the state space of the upper intelligent agent is formed by the active power and the reactive power of all nodes of the power distribution network,Active power, node voltage amplitude, etc. of the photovoltaic device, and is defined as +.>。

Exemplary, action spaceRepresenting a set of upper level agent action spaces; at->At the moment, the action space of the upper-layer intelligent body is composed of the action instructions of the parallel capacitor bank and the phase change switch at the moment, and can be defined as。

Since the parallel capacitor bank has two actions of on and off, the dimension of the binary action space set is proportional to the number of the parallel capacitor banks and can be expressed as. Similarly, the phase change switch has three different actions of conducting A phase, B phase and C phase, so the dimension of the action space set can be expressed as。

Illustratively, the reward functionRepresenting the rewarding function of the upper level agent. At->When the moment, the rewarding function of the upper-layer intelligent agent comprises three parts of the sum of zero sequence and negative sequence current components passing through the transmission and distribution connection node, the voltage out-of-limit penalty value and the voltage unbalance out-of-limit penalty value at the moment, and in order to ensure that the rewarding function value tends to the maximum value, the method can be used for controlling the transmission and distribution connection node to realize the purpose ofAnd (5) calculating to obtain the product.

It should be noted that the number of the substrates,represented by the zero sequence and negative sequence components of the transmission and distribution connection node, respectively, < >>The amplitude at the time t is represented as the time t, and the t is represented as the formula (1) and the formula (2);The penalty term and the penalty term of the voltage unbalance degree violation are respectively expressed as a penalty term of node voltage out-of-limit and a penalty term of voltage unbalance degree violation, and are shown in the formulas (3) and (4):

（1）

（2）

（3）

（4）

wherein, in the steps (1) and (2),is->Phase voltage amplitude>； andRespectively represent the communication node through transmission and distribution>Active and reactive power of the phase.

Wherein, in the step (4),is a three-phase node->The degree of voltage imbalance of (2) can be calculated from the following formula (5)>The voltage is a set of all three-phase node voltages in the three-phase asymmetric power distribution network;

（5）

（5） in the formula ,for node->Is used for the phase voltage of the capacitor.

Further, in the state transition probability function, since the state of the next moment of the power distribution network is only dependent on the state of the current moment and the action taken under the current strategy, the state transition probability functionObeying a markov decision process.

It should be noted that, based on the power flow calculation result, the three-phase unbalanced operation condition of the actual power distribution network is simulated, and in the training process of the whole model, the state transition relation satisfies the power flow constraint of the power distribution network.

By way of example only, the present invention is directed to a method of,is a discount factor for levelingWeights of instant rewards and future rewards are balanced.

In this embodiment, the upper level agent is at each ofGlobal observation state information of three-phase asymmetric distribution network at moment +.>Obtain corresponding action instruction->Then, based on the action instruction, carrying out power flow calculation of the power distribution network to obtain a reward function value of the current moment>Observation state at the next moment +.>The method comprises the steps of carrying out a first treatment on the surface of the Based on the step length circulation, the object of the upper-layer agent is to learn the optimal switching strategy of the regulating equipment through the repeated interaction of the agent and the three-phase asymmetric power distribution network environment, thereby realizing accumulated discount rewards +.>Is maximized.

Further, the markov decision process is solved by adopting a first calculation method to solve the markov decision model, and the first calculation method is adopted to solve the markov decision process by adopting an DQN method, so that an optimal control strategy of the parallel capacitor bank and the phase change switch is obtained.

Note that the DQN method uses deep neural networks to fit motion cost functions, wherein ,、Respectively are provided withStates and actions expressed as environment, +.>The weight parameters to be optimized for the Q network.

Participation in fig. 3, fig. 3 is a network architecture diagram of an DQN method according to an embodiment of the invention; as can be seen, the Q network consists of an input layer, two hidden layers and an output layer. Its input is the currentGlobal status information of the distribution network at the moment->The number of neurons is the number of elements in the state space set; its output contains status->Predictive Q values for all possible actions of the lower parallel capacitor bank and the commutation switch, which are common +.>And neurons.

Further, according to these predicted Q values, the agent will select the action with the largest Q value to be effective at the next moment.

Preferably, in order to improve the stability and convergence of the Q network during training, the DQN method introduces a target Q network and an empirical playback mechanism, and the loss function is shown in formula (6):

（6）

wherein in the formula (6),expressed as desired value +.>Expressed as target Q value, +.>For rewarding function value->For discounts factor->For the weight parameter of the target Q network, +.>Represented as predicted Q value.

Updating parameters of a loss function using an Adam optimizerThe update formula of the Q network parameters can be obtained as follows:

（7）

wherein in the formula (7), andRespectively->Time and->Q network parameters of time of day->Is the learning rate. In order to ensure that the intelligent agent can actively explore the unknown environment while effectively utilizing the environment information, the method adopts a method of +.>Greedy policies select actions of the Q network, namely:

（8）

wherein the method comprises the steps ofIn the formula (8), the amino acid sequence of the compound,is a constant, & gt>Is a randomly generated number. When->When the intelligent agent selects one action at random in the action space, otherwise, the intelligent agent selects the action with the largest Q value in the current state.

It should be noted that, the greedy strategy refers to that the agent selects the action with the largest Q value under the condition of high probability, and the remaining small probability condition generates a motion to be explored immediately, so as to avoid sinking into the locally optimal solution.

Referring to fig. 5, for further explanation of the above principle, fig. 5 is a flowchart of upper-level agent training according to an embodiment of the present invention;

in this embodiment, the step of constructing the markov game model and the step of solving the markov game model by using the second calculation method is a short time scale control sub-step, and is to model the cooperative control problem of the photovoltaic inverter as a partially observable markov game problem. The model adopts a plurality of agents to represent optimization decisions and information interaction of different areas, and each agent is independently responsible for action instructions of photovoltaic inverters in the sub-area where the agent is located. The markov game is composed mainly of the following parts.

Specifically, the state spaceWhich represents a collection of all underlying agent state spaces. Wherein, intelligent agent->At time->Inner time interval->Status of time->Comprising distribution network areas->Active power and reactive power of all nodes in the photovoltaic system, active power and reactive power of the photovoltaic system, node voltage amplitude and time +.>The state information of the internal parallel capacitor bank and the phase change switch can be expressed as:。

Specifically, the action spaceWhich represents the collection of all agent action spaces in the lower layer. Wherein, intelligent agent->At time->Inner time interval->Action of all photovoltaic inverters at the time +.>Can be expressed as a ratio to the maximum reactive output of the inverter, then by +.>The region +.>Reactive output value of each photovoltaic inverter.

Specifically, the reward functionWhich is represented as a bonus function shared by the underlying multi-agent. At->The first time of dayAt intervals of time, agent->The prize value of (2) may be defined as:

（9）

wherein in the formula (9),for a three-phase unbalance current reference value,/->、Region +.>The penalty term when the internal node voltage exceeds the threshold and the voltage unbalance exceeds the threshold is expressed in the same form as the expression (3) and the expression (4).

The state transition probability function in the lower multi-agentDesign of (2)>The value is chosen similarly to the upper level agent.

Further, in the underlying multi-agent architecture, every other intervalTime interval, each intelligentThe body is based on local state information +.>Obtain corresponding action policy->And then carrying out load flow calculation of the three-phase asymmetric distribution network based on a forward push back substitution method to obtain measurement information such as voltage amplitude values of all nodes, and finally calculating a reward function value at the current moment on the basis, and transferring the three-phase asymmetric distribution network to a state at the next moment.

With reference to fig. 6, further explanation of the above principle is provided in fig. 6, which is a flowchart of the lower multi-agent training according to an embodiment of the present invention.

In this embodiment, the solution of the markov game adopts a second calculation method, and the second calculation method adopts a MAAC (Multi-attention-action-assessment) method to solve the markov game, so that each agent selectively focuses on the related information of other agents in the Q value estimation process, thereby greatly reducing the calculation complexity and the storage space.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating an attention mechanism of an agent Q function according to an embodiment of the present invention; agent Q functionExcept that local observation state information Q and motion information of the own are considered +.>Besides, the contribution degree of the local information of other intelligent agents is also considered>As shown in formula (10):

（10）

wherein in the formula (10),denoted as MLP (multi-layer supercedron), double-layer multi-layer perceptron,represented as a single layer MLP encoder, while contribution +.>Expressed as a weighted sum of all agent encoded values except the agent, as shown in formula (11):

（11）

wherein in the formula (11),expressed as encoding agent +.>A parameter sharing matrix converted into a value;Is a nonlinear activation function;For allocation to agents->Is by weight of attention to +.>And->Bilinear map acquisition is performed, and similarity between coded values is then transferred based on softmax operation, and the specific expression is shown in (12):

（12）

wherein in the formula (12),expressed as +.>Converting into a parameter sharing matrix of 'key value';Expressed as +.>The parameter sharing matrix is converted into a key code.

Based on the three trainable parameter sharing matricesAll evaluation networks can be iteratively updated by minimizing the joint regression loss function as shown in equation (13):

（13）

wherein in the formula (13),。

each agent may then update parameters of its own action network based on the gradient policiesAs shown in formula (14):

（14）

wherein in the formula (14),，expressed as except for agent->A set of all agents outside, < >>，Represented as a multi-agent dominance function.

Finally updating the target network parameters based on equation (15):

（15）

wherein in the formula (15),the coefficients are updated for soft.

In this embodiment, S4 adopts two-step collaborative training to construct an upper level agent of a markov decision model and a lower level agent of a markov game model.

Referring to fig. 7, fig. 7 is a flowchart of an execution strategy of a multi-time scale control method according to an embodiment of the present invention; the upper and lower double-layer intelligent bodies are cooperatively trained by adopting a two-step method and comprise corresponding parameters, wherein,expressed as its set of parameters to be optimized, +.>Is the weight parameter set to be optimized of the upper-layer agent,/-for the upper-layer agent>Is a weight parameter set to be optimized of the lower multi-agent.

In one embodiment, building a Markov decision model from a set of five-tuple comprises: state spaceAction space->Bonus function->State transition probability function->、The method comprises the steps of carrying out a first treatment on the surface of the S2.1 setting State space->Active power and reactive power of all nodes of the power distribution network, active power of photovoltaic equipment and node voltage amplitude; s2.2 setting action space +.>The device comprises a parallel capacitor bank and an action instruction of a phase change switch; s2.3 setting a reward function +.>The method comprises the steps of adding zero sequence and negative sequence current components through a transmission and distribution connection node, and obtaining a voltage out-of-limit penalty value and a voltage unbalance out-of-limit penalty value; s2.4 setting a state transition probability function +.>The method comprises the steps of carrying out a first treatment on the surface of the Status space->Action space->Bonus function->The characterized upper level agent and is used to maximize the cumulative discount rewards.

In this embodiment, in order to verify the correctness and feasibility of the above method, the construction and training process of the proposed method is performed in Python 3.9 of Pytorch framework. Or other systems which can be used for constructing models and training processes are selected, and the selection is specifically based on experimental suitability.

；

In this embodiment, according to the construction and training process of the proposed method performed in Python 3.9 of pythorch framework, the parameters of the DQN method used are set as follows:

；

wherein ,expressed as desired value +.>Expressed as target Q value, +.>For rewarding function value->For discounts factor->For the weight parameter of the target Q network, +.>Expressed as predicted Q value;

s2.8 use ofGreedy policies select actions of the Q network.

In this embodiment, the Q network outputs a real function according to the function Q ϕ (s, a) which is typically a function with a parameter ϕ, such as a neural network. The target Q network may be DQN (Deep Q-network), which refers to a Q learning algorithm based on Deep learning, mainly combines value function approximation and neural network technology, and performs network training by using a target network and experience playback method.

In this embodiment, the empirical playback mechanism is specifically as follows: an empirical playback mechanism is used to break the correlation between sample data. Namely, an experience playback buffer zone is constructed, and experience data obtained by the interaction of an intelligent agent and a three-phase asymmetric power distribution network environment are obtained in a training roundIs stored into the buffer. When the experience buffer storage reaches the set capacity, on the one hand, the agent starts to update its own network parameters, i.e. a certain amount of experience sample data is randomly extracted from the experience playback buffer first +.>Iterative updating of network parameters is then implemented based on the extracted sample data. On the other hand, the experience playback buffer zone can automatically delete sample experience data generated by the initial interaction with the three-phase asymmetric distribution network environment and store the latest learned sample experience data.

In this embodiment, adam optimizer (Adaptive Moment Estimation, optimizer) has the effect of fast gradient descent and easy oscillation around the optimal value.

Further, the Markov game model comprises; s3.1 setting State spaceState space->Comprising distribution network areas->Active power and reactive power of all nodes in the photovoltaic power generator, active power and reactive power of photovoltaic power generator and node voltage amplitudeTime +.>Status information of the internal parallel capacitor bank and the phase change switch; s3.2 setting action space->The reactive power output value of each photovoltaic inverter in the area is calculated; s3.3 setting a reward function->A reward function for representing underlying multi-agent sharing; every +.>And at time intervals, obtaining a corresponding action strategy through each intelligent agent according to the local state information in the area, then carrying out load flow calculation of the three-phase asymmetric power distribution network to obtain measurement information such as voltage amplitude values of all nodes, finally calculating a reward function value at the current moment on the basis, and transferring the three-phase asymmetric power distribution network to the next moment.

It should be noted that in multi-agent reinforcement learning, each agent will have its own cost function and will autonomously learn and formulate strategies based on environmental observations and interactions with the goal of maximizing its utility value.

Further, since each agent does not take into account the impact of its policies on other agents when interacting with the environment. Thus, there may be a competition or collaboration situation under the influence of multiple agents interacting with each other. While multi-agent decisions can be specifically analyzed using game theory. For different multi-agent reinforcement learning scenes, different game frameworks can be adopted to simulate interaction scenes, and the three categories can be divided as a whole.

For example, a static game in which all agents make decisions at the same time and each agent makes only one action. Since each agent acts only once, it can make some unexpected fraud and traitor strategies to benefit itself from gaming. Thus, in static gaming, each agent needs to consider and protect against fraud and traitors of other agents in making policies to reduce its own losses.

For example, a repeated game is one in which multiple agents take repeated decision actions in the same state. Thus, the total cost function for each agent is the sum of its value at each decision action. Compared with static game, repeated game greatly avoids malicious action decisions among multiple agents, so that the sum of total benefit values of all agents is improved as a whole.

By way of example, random gaming (or markov gaming) may be considered a markov process in which there are multiple agents making action decisions in multiple states. Each intelligent agent can make an optimal action decision for improving the self-cost function through observing the environment and predicting the actions of other intelligent agents according to the state of the intelligent agent.

In this embodiment, parameters of the MAAC method used in the training process for solving the Markov game model are as follows:

in combination with the related principles of the above embodiments, the related data is further added to demonstrate the feasibility. Specifically, the three-phase unbalanced current represents the advantage of the proposed control method in the aspect of compensating the three-phase unbalanced current, and in order to introduce a compensation degree index to quantify the treatment effect of the current unbalance, the compensation degree of the negative sequence and zero sequence current components is respectively defined as follows:

（16）

（17）

wherein in the formulas (16) and (17), andThe compensation degree of positive sequence, negative sequence and zero sequence current components respectively;Andthe amplitudes of the current components of the negative sequence before and after reactive power compensation are respectively; andThe amplitudes of the zero sequence current components before and after reactive power compensation are respectively obtained. Then, respectively randomly extracting two typical days from sample data of a test set, and testing 960 groups of sample data to obtain the average zero sequence current, the average negative sequence current and the compensation degree value in the test set as follows:

furthermore, in the control method, the average negative sequence current component and the average zero sequence current component which pass through the transmission and distribution connection node in the test set are greatly reduced compared with the original value, the compensation degree is more than 55%, and the advantages of the method in the aspect of three-phase unbalanced current compensation are verified.

Specifically, in order to verify the advantages of the proposed control method in voltage control, and introduce a success rate index to quantify the effect of node voltage amplitude control, the success rate of voltage regulation is defined as follows:

（18）；/>

wherein in the formula (18),the success rate of voltage amplitude adjustment;The number of voltage out-of-limit accidents before the regulation method is adopted;The voltage amplitude values are all in the safe range after the regulation method is adopted. Then, based on the 960 groups of sample data, performing voltage control verification test, and obtaining the statistical result of the voltage amplitude adjustment success rate after the method is adopted, wherein the statistical result is as follows:

preferably, the experimental results of the maximum value, the minimum value and the maximum value of the voltage unbalance degree of the three-phase voltages of the nodes a, b and c in the test set data after the method is adopted are as follows:

referring to fig. 8a to 11b, the probability distribution diagrams of the phase voltage amplitudes of the nodes a, b and c and the three-phase voltage unbalance in the test set data except the node 0 before and after the proposed method are shown.

FIG. 8a is a graph showing a frequency distribution of the amplitude of the phase voltage at node a, wherein two 4000 frequency values exist in the range of 0.95p.u and 0.97 p.u; FIG. 8b is a plot of amplitude versus frequency for another phase voltage at node a, up to 6000 at 0.99 p.u. frequency, in accordance with one embodiment of the present invention; FIG. 9a is a plot of node b phase voltage magnitude versus frequency for a frequency of 4500 frequency values around 0.98p.u according to an embodiment of the present invention; FIG. 9b is a plot of amplitude versus frequency for another phase of voltage at node b for an embodiment of the present invention, the plot having frequency values above 4000 at 0.99p.u to 1 p.u; FIG. 10a is a plot of the amplitude versus frequency of the phase voltage at node c versus 3500 for a plot of 0.96p.u to 0.98p.u frequency; FIG. 10b is a plot of amplitude versus frequency for another phase voltage at node c for an embodiment of the present invention, with dense frequency values of ripple occurring at 0.98p.u to 1.01 p.u; FIG. 11a is a plot of voltage imbalance frequency distribution for an embodiment of the present invention, centered around 1400 at 0.5% imbalance to 1.5% imbalance frequency; FIG. 11b is a plot of another voltage imbalance frequency distribution graph showing frequency values at 0% imbalance and frequency values zero before 2% imbalance, and no more than 1500 frequency values, according to an embodiment of the present invention.

It should be noted that, by jointly scheduling different types of regulating equipment such as the parallel capacitor bank, the phase-change switch and the photovoltaic inverter, the control method not only can avoid the phenomenon of voltage out-of-limit by 100%, but also can make the voltage amplitude of each phase of a, b and c relatively stable and close to the rated voltage. In addition, the method can also enable the three-phase voltage amplitude values of all the nodes to be similar, namely, the voltage unbalance degree is ensured to be in a safety range.

Claims

1. A multi-type resource cooperative regulation method for three-phase unbalanced management of a distribution network is characterized in that,

s1 setting five-tuple setAs construction model coordinates, state space->Action space->Bonus function->State transition probability function->、；

S2, constructing a Markov decision model, and solving the Markov decision model by adopting a first calculation method to obtain a control strategy of the parallel capacitor bank and the phase change switch;

s3, constructing a Markov game model, and solving the Markov game model by adopting a second calculation method to enable the selected agent to selectively pay attention to information of non-selected agents in the Q value estimation model;

s4, constructing an upper-layer agent of a Markov decision model and a lower-layer agent of a Markov game model by adopting two-step collaborative training;

2. The resource collaborative regulation method of claim 1, wherein S1 comprises:

s2.1 setting the State spaceActive power and reactive power of all nodes of the power distribution network, active power of photovoltaic equipment and node voltage amplitude;

s2.2 setting the action spaceThe device comprises a parallel capacitor bank and an action instruction of a phase change switch;

s2.3 setting the reward functionThe method comprises the steps of adding zero sequence and negative sequence current components through a transmission and distribution connection node, and obtaining a voltage out-of-limit penalty value and a voltage unbalance out-of-limit penalty value; wherein (1)>Respectively expressed as zero sequence and negative sequence components passing through the transmission and distribution connection node;The punishment items are respectively expressed as punishment items of node voltage out-of-limit and punishment items of voltage unbalance violation;Represented as amplitude at time t; t is denoted as time t;

s2.4 setting a State transition probability function；

The state spaceSaid action space->The reward function->The characterized upper level agent and is used to maximize the cumulative discount rewards.

3. The resource collaborative regulation and control method of claim 1, wherein the markov decision model solution includes;

s2.5, fitting an action cost function according to the deep neural network; fixed stateTake action->Policy based->The specific process of continuously interacting with the environment to obtain the desired rewards defines an action-cost function as a Q function:

;

4. A resource co-ordination method as claimed in claim 1 or claim 3, wherein the markov decision model solution further comprises;

s2.6, applying a target Q network and an experience playback mechanism;

s2.7 updating parameters of the loss function with the Adam optimizerWherein, all evaluation networks can be iteratively updated by minimizing a joint regression loss function, the loss function being:

;

s2.8 use ofGreedy policies select actions of the Q network.

5. The resource co-regulation method of claim 1, wherein the markov game model comprises;

s3.1 setting State spaceThe state space->Comprising distribution network areas->Active power and reactive power of all nodes in the photovoltaic system, active power and reactive power of the photovoltaic system, node voltage amplitude and time +.>Status information of the internal parallel capacitor bank and the phase change switch;

s3.2 setting an action spaceThe reactive power output value of each photovoltaic inverter in the area is calculated;

s3.3 setting a reward functionA reward function for representing underlying multi-agent sharing;

every other interval in the underlying multi-agent architecture of the Markov game modelAt intervals according to the time of each agentAnd (3) obtaining corresponding action strategies according to local state information in the area, then carrying out load flow calculation of the three-phase asymmetric power distribution network to obtain measurement information such as voltage amplitude values of all nodes, finally calculating a reward function value at the current moment on the basis, and transferring the three-phase asymmetric power distribution network to the next moment.

6. The resource collaborative conditioning method according to claim 1, wherein said solving a markov gaming model comprises;

s3.4, testing local observation state information and action information of the intelligent agent and also testing contribution degree of the local information of the intelligent agent;

s3.5 based on three trainable parameter sharing matrices, all evaluation networks are iteratively updated by minimizing the joint regression loss function:

s3.6, each agent updates parameters of the own action network based on the gradient strategy;

and S3.7, updating target network parameters to enable each selected intelligent agent to pay attention to the relevant information of the non-selected intelligent agent in the Q value estimation process.