[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115238592A - Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method - Google Patents

Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method Download PDF

Info

Publication number
CN115238592A
CN115238592A CN202210967237.1A CN202210967237A CN115238592A CN 115238592 A CN115238592 A CN 115238592A CN 202210967237 A CN202210967237 A CN 202210967237A CN 115238592 A CN115238592 A CN 115238592A
Authority
CN
China
Prior art keywords
state
action
strategy
function
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210967237.1A
Other languages
Chinese (zh)
Inventor
殷林飞
曹星辉
熊轶
胡立坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University
Original Assignee
Guangxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University filed Critical Guangxi University
Priority to CN202210967237.1A priority Critical patent/CN115238592A/en
Publication of CN115238592A publication Critical patent/CN115238592A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2113/00Details relating to the application field
    • G06F2113/04Power grid distribution networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/02Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a multi-time-scale meteorological prediction, distribution and parallel trust strategy optimization power generation control method, which combines multi-time-scale meteorological prediction, distribution and parallel trust strategy optimization neural networks and is used for power generation control of a novel power system. Firstly, the multi-time scale meteorological prediction in the method is used for processing meteorological data of different time scales and predicting future meteorological changes. Secondly, the distributed parallel trust strategy optimization method in the method is used for coordination and quick reaction among power plants in the region. The improved multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method can solve the problem of fast and stable regulation and control of novel power systems with different time scales under the condition of continuously changing weather, realizes the function of power generation control of the novel power systems through meteorological prediction, optimizes regulation and control precision and improves regulation and control speed.

Description

Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method
Technical Field
The invention belongs to the field of power generation control of novel power systems of power systems, relates to artificial intelligence, quantum technology and a power generation control method, and is suitable for power generation control of novel power systems and comprehensive energy systems.
Background
The problem that environmental factors are not fully considered exists in the automatic power generation control of the existing novel power system, and the novel power system cannot accurately track the environment to be adjusted.
In addition, the traditional strategy optimization network needs a large amount of data for training, and the network training speed is slow and dimensionality disaster is easy to occur due to the large dimensionality of the data.
Therefore, the multi-time-interval meteorological prediction distribution parallel trust strategy optimization power generation control method is provided, the problem that the novel power system cannot accurately track the environment for adjustment can be solved, the training speed of the novel power system is accelerated, and the dimension disaster is eliminated.
Disclosure of Invention
The invention provides a multi-time-interval meteorological prediction distribution parallel trust strategy optimization power generation control method, which combines multi-time-scale meteorological prediction, distribution parallel and trust strategy optimization neural networks and is used for power generation control of a novel power system; the method for optimizing the power generation control by the multi-time-distance meteorological distribution parallel trust strategy comprises the following steps in the use process:
step (1): defining each controlled power generation area as an Agent, and marking each area as { Agent 1 ,Agent 2 ,…,Agent i H, where i is the index of the respective power generation region; all power generation areas are not interfered with each other and are mutually connected, so that the robustness is high;
step (2): initializing parameters of a stack self-coding neural network and a gating cycle unit, collecting a meteorological wind intensity and illumination intensity data set in three years, extracting a meteorological characteristic data set, and inputting the meteorological characteristic data set into the stack self-coding neural network;
the stacked self-coding neural network is formed by stacking a plurality of self-coding neural networks, wherein x is an input meteorological characteristic data vector, x is an n-dimensional vector, and x belongs to R n (ii) a Will self-encode neural networks AE 1 Is hiddenHidden layer h (1) As a self-encoding network AE 2 Input of (3), training of self-encoding network AE 2 Then the self-encoding network AE is further processed 2 Hidden layer h of (2) As a self-encoding network AE 3 The input of (2) and so on; after the data are stacked layer by layer, the feature dimension of the weather data can be reduced, the training speed of the gate control circulation unit can be accelerated, and meanwhile, the key information of the data can be stored; the dimensions of each hidden layer are different, so that:
Figure BDA0003794337310000011
wherein h is (1) Is a self-coding neural network AE 1 Hidden layer of (a), h (2) Is a self-coding neural network AE 2 Hidden layer of h (p -1) Is a self-coding neural network AE p-1 Hidden layer of (a), h (p) Is a self-coding neural network AE p A hidden layer of (a); w is a group of (1) Is a hidden layer h (1) Parameter matrix of, W (2) Is a hidden layer h (2) Parameter matrix of, W (p) Is a hidden layer h (p) A parameter matrix of (2); b is a mixture of (1) Is a self-coding neural network AE 1 Bias of (b), b (2) Is a self-coding neural network AE 2 B (p) is a self-encoding neural network AE p Bias of (c); f () is an activation function; p is the number of layers of the stacked self-coding network stack; softmax () is a normalized exponential function, used as a classifier;
in a stacked self-coding neural network, if h: (b) ((c)) 1 ) Is m dimension, h: ( 2 ) For k dimensions, from self-encoding networks AE 1 Stack to self-encoding network AE 2 Training a network of n → m → k structure; training the network n → m → n to get the transformation of n → m, then training the network m → k → m to get the transformation of m → k, and finally self-encoding the network AE 1 And self-encoding network AE 2 Stacking to obtain a network n → m → k; through a self-encoding network AE 1 To AE p Stacking layer by layer, and finally solving an output vector through a softmax function
Figure BDA0003794337310000012
After training of the stack self-coding neural network, certain initial value network parameters and weather characteristics after dimension reduction are obtained
Figure BDA0003794337310000013
As an output;
and (3): order to
Figure BDA0003794337310000014
Output vector for time t self-coding neural network
Figure BDA0003794337310000015
Pre-training the stack self-coding neural network to obtain the initial values of the network parameters and the meteorological features
Figure BDA0003794337310000016
The inputs of the refresh gate and the reset gate in the gated cyclic unit are set as
Figure BDA0003794337310000017
The outputs of the update gate and the reset gate in the gated loop unit are respectively:
Figure BDA0003794337310000021
wherein z is t Updating the output of the gate for time t, r t Reset the output of the gate for time t, h t-1 For the hidden state of the gated-cyclic unit at time (t-1), x t For input at time t, [ 2 ]]Denotes that two vectors are connected, W z To update the weight matrix of the gate, W r σ () is a sigmoid function, which is a weight matrix of the reset gate;
the gate control circulation unit discards and memorizes the input information through two gates to obtain a candidate hidden state value at the time t
Figure BDA0003794337310000022
Comprises the following steps:
Figure BDA0003794337310000023
wherein tanh () represents a tanh activation function,
Figure BDA0003794337310000024
as a hidden state value
Figure BDA0003794337310000025
A weight matrix of (a); * Representing the product of the matrices;
after the tanh activation function obtains updated state information through the update gate, vectors of all possible values are created according to input, and candidate hidden state values are obtained through calculation
Figure BDA0003794337310000026
Then calculating the state h at the time t through the network t ,h t Comprises the following steps:
Figure BDA0003794337310000027
the reset gate determines the number of past states that are desired to be remembered; when r is t 0, state information h at time (t-1) t-1 Can be forgotten to be in a hidden state
Figure BDA0003794337310000028
Will be reset to the information input at time t; the update gate determines the number of past states in the new state; when z is t When 1, the hidden state
Figure BDA0003794337310000029
Update to state h at time t t (ii) a The gate control circulation unit reserves important characteristics through a gate function by updating and resetting the gate storage and filtering information, and captures dependence items through learning so as to obtain an optimal meteorological predicted value;
and (4): after the training of the stack self-coding neural network and the gating cycle unit is finished, meteorological data to be predicted are input into the gating cycle unit through the stack self-coding neural network, and an obtained meteorological predicted value is input into the novel power system; in a novel power system, three parallel trust strategy optimization networks of short-time distance, medium-time distance and long-time distance are arranged in each power generation area, wherein the short-time distance is one day, the medium-time distance is fifteen days, and the long-time distance is three months;
and (5): initializing parameters of a parallel trust strategy optimization network in each region, setting a parallel trust strategy optimization network strategy, and initializing a parallel expectation value table in the parallel trust strategy optimization network, wherein the initial expectation value is 0;
and (6): setting the iteration frequency as X, setting the initial value of the search frequency as a positive integer V, and initializing the search frequency of each intrinsic action as V = V;
and (7): in the current state, the parallel trust strategy optimization network in each agent selects an action by means of a strategy to obtain an award value corresponding to the action in the current environment, and feeds the obtained award value back to a parallel expected value table, and then the iteration number is increased by one; if the current iteration times are equal to X, the iteration is completed, and a trained parallel trust strategy optimization network is obtained;
and (8): carrying out strategy optimization and parallel value optimization in a parallel trust strategy optimization network, wherein the optimization method comprises the following steps:
the core of the parallel trust optimization strategy network is an actor-critic method; in policy optimization of a parallel belief optimization policy network, the Markov decision process is the tuple (S, A, P, r, ρ) 0 Gamma), wherein S is a state space consisting of wind power intensity, illumination intensity, frequency deviation delta f, area control error ACE and tie line power exchange assessment index CPS, and any state S belongs to S; a is the power variation Δ P of different magnitude Gi ,i=1,2,...,2 j A space of motion where j is the quantum superposition state motion | A>Any action a belongs to A; p is a transition probability distribution matrix for transitioning from an arbitrary state s to a state s' through an arbitrary action a; r () is a reward function; ρ is a unit of a gradient 0 Is in an initial state s 0 Probability distribution of(ii) a Gamma is a discount factor; let π denote the random strategy π S × A → [0,1](ii) a The desired jackpot function η (π) under policy π is:
Figure BDA00037943373100000210
wherein s is 0 Is in an initial state, a 0 Is in a state s 0 Action of pi selection, gamma, of lower random strategy t For a discount factor at time t, s t Is the state at time t, a t The operation at time t, r(s) t ) Is in a state s t The value of the prize to be paid down,
Figure BDA00037943373100000211
representing a state s in strategy pi t Downward pair action a t Sampling is carried out;
introducing a state-action value function Q π (s t ,a t ) Function of state values V π (s t ) Advantage function A π (s, a) and probability distribution function ρ π (s);
Function Q of state-action value π (s t ,a t ) Finding the state s in strategy pi t Lower execution action a t Late jackpot, state-action value function Q π (s t ,a t ) Comprises the following steps:
Figure BDA0003794337310000031
wherein s is t+1 The state at time (t + 1); s t+l The state at time (t + l); a is t+1 An action at time (t + 1); a is a t+l Is the action at time (t + l);
Figure BDA0003794337310000032
representing a state s in strategy pi t+l Downward pair action a t+l Sampling is carried out; gamma ray t+l A discount factor for time (t + l); l is a positive integer; r(s) t+l ) Is in a state s t+l A reward value of;
function of state value V π (s t ) Finding the state s in strategy pi t Accumulated award of, V π (s t ) Is Q π (s t ,a t ) Regarding action a t Of the mean value, the function of the state value V π (s t ) Comprises the following steps:
Figure BDA0003794337310000033
wherein
Figure BDA0003794337310000034
Expressing Q in the strategy pi π (s t ,a t ) Regarding action a t Average value of (a);
dominance function A π (s, a) to find the advantage of taking an arbitrary action a in an arbitrary state s compared to the average, the advantage function A π (s, a) is:
A π (s,a)=Q π (s,a)-V π (s) (8)
Q π (s, a) is the state s t Is in an arbitrary state s, action a t A state-action value function for any action a; v π (s) is the state s t Is a function of the state value at any state s;
probability distribution function ρ π (s) solving probability distribution under any state s in strategy pi, probability distribution function rho π (s) is:
Figure BDA0003794337310000035
wherein P(s) t = s) is the state s at time t t A transition probability distribution matrix when in an arbitrary state s;
the parallel value optimization of the parallel trust optimization strategy network is to a state-action value function Q π (s t ,a t ) Performing parallel optimization; state-action value function Q π (s t ,a t ) Is optimized towards Q π (s t ,a t ) Introducing quantum bits and a Grover search method to accelerate the network training speed and eliminate the dimensionality disaster;
parallel value optimization will act a t Quantising and using in state s t Updating the searching times V instead of the state s t The action probability is updated as follows:
is provided with an action space 2 j An intrinsic action of 2 j An intrinsic action | a t >Is represented by the superposition of
Figure BDA0003794337310000036
Will act a t Quantization to j-dimensional quantum superposition state action | A>,|A>Is each qubit of |0>And |1>Superposition of the two states; quantum superposition state motion
Figure BDA0003794337310000037
Equivalent to | A>(ii) a j-dimensional quantum superposition state motion | A>The expression of (a) is:
Figure BDA0003794337310000038
wherein | a>Is a j-dimensional quantum superposition state action | A>Observed quantum motion; c a Is a quantum motion | a>A probability amplitude; i C a | Quantum action | a>A modulus value of (a) satisfies
Figure BDA0003794337310000039
When the quantum superposition state motion | A > is observed, | A > will collapse to a quantum motion | a >, and each qubit of quantum motion | a > is |0> or |1>; each qubit of quantum action | a > has a value representing its desired value; the expected values of these qubits are different in different states; the expected values are used for selecting actions in a certain state, and the updating rules of the expected values are as follows:
if in strategy π, in state s t Down, quantum superposition state action | A>Collapse into a quantum motion | a>(ii) a The action | a is such that the qubit in each qubit is |0 when the jackpot η increases>Is reduced, the qubit is |1>The expected value of (d) increases; updating the increment and decrement of the expected value according to strategy pi, and setting the quantum position with positive expected value of each quantum position of the action as |1 when the action is selected in the same state>The rest of the qubits are |0>Obtaining quantum motion; let the parameter vector of strategy pi be
Figure BDA0003794337310000041
Normalizing the quantum action to a specific action value and passing through a parameter vector
Figure BDA0003794337310000042
Converted into power variation quantity delta P G1 As Agent in this state 1 Outputting; wherein theta is 01 ,…,θ u As a vector of parameters
Figure BDA0003794337310000043
U is a positive integer;
quantum action selection is to obtain state s at time t by Grover search method t Observing quantum superposition state action | A>Quantum action | a obtained by collapsing>The obtaining and updating rules of the search times V are as follows:
first, all intrinsic actions are superimposed with equal weight, and j Hadamard gates are sequentially applied to j independent qubits in initial states of |0>, resulting in:
Figure BDA0003794337310000044
wherein H is a Hadamard gate, H being capable of coupling the ground state |0>Conversion to equal weight superposition
Figure BDA0003794337310000045
Figure BDA0003794337310000046
Is the initialized quantum superposition state action, which is composed of 2 j The actions with the same probability amplitude are superposed together;
Figure BDA0003794337310000047
means that j Hadamard gates are sequentially applied to 2 j An initialization intrinsic action; two parts of the Grover iterative operator are U |a> And
Figure BDA0003794337310000048
respectively as follows:
Figure BDA0003794337310000049
where I is a unitary matrix of appropriate dimensions;<a | is a quantum action | a>Reverse quantum action of (d);<0| is a quantum action |0>Reverse quantum action of (d);
Figure BDA00037943373100000410
is a quantum motion
Figure BDA00037943373100000411
The reverse quantum action of (2); | a><a | is a quantum action | a>The outer product of (d);
Figure BDA00037943373100000412
is initialized quantum superposition state action
Figure BDA00037943373100000413
The outer product of (d); u shape |a> And
Figure BDA00037943373100000414
is a quantum black box; u shape |a> Acting on the intrinsic action | a t >While, U |a> Change and | a>The phase of the upward state in the same direction is changed by 180 degrees;
Figure BDA00037943373100000415
acting on the intrinsic action | a t >When the utility model is used, the water is discharged,
Figure BDA00037943373100000416
change and
Figure BDA00037943373100000417
the phase of the state in the same direction is changed by 180 degrees;
record Grover iteration as U Grov
Figure BDA00037943373100000418
According to U Grov Iterating each intrinsic action, obtaining the closest iteration times under the minimum iteration times, updating the closest iteration times into the iteration times V corresponding to each intrinsic action, and updating into a strategy pi;
from equation (9), the state s at time t t For any state s, the strategy is updated from pi to
Figure BDA00037943373100000420
Expected cumulative reward function of time
Figure BDA00037943373100000419
Comprises the following steps:
Figure BDA0003794337310000051
wherein
Figure BDA0003794337310000052
Is an updated policy; p(s) t = s) as being in policy
Figure BDA0003794337310000053
State s at time t t A transition probability distribution matrix at any state s; a. The π (s, a) isOn-policy
Figure BDA0003794337310000054
State s at time t t Is a merit function at any state s;
Figure BDA0003794337310000055
is in the policy of
Figure BDA0003794337310000056
State s at time t t A transition probability distribution matrix at any state s;
Figure BDA0003794337310000057
is in the policy
Figure BDA0003794337310000058
Middle state s t Downward pair action a t Sampling is carried out;
Figure BDA0003794337310000059
is in the policy
Figure BDA00037943373100000510
Sampling value a in any state s;
Figure BDA00037943373100000511
is in the policy of
Figure BDA00037943373100000512
Probability distribution in any state s; a. The π (s t ,a t ) Is the state s at time t t Take action a t Superiority over average; η (π) is the desired jackpot function under strategy π;
at an arbitrary state s, there are
Figure BDA00037943373100000513
Wherein
Figure BDA00037943373100000514
For updated policies
Figure BDA00037943373100000515
State s at time t t The accumulated reward eta can be improved by the action value selected down, or the accumulated reward eta is kept unchanged under the condition that the expected advantage is zero, so that the strategy is continuously updated to optimize the accumulated reward eta;
since the updated strategy needs to be calculated in equation (14)
Figure BDA00037943373100000516
Probability distribution of
Figure BDA00037943373100000517
This results in higher complexity of equation (7), which is difficult to optimize, and introduces a substitution function to reduce the complexity of the calculation
Figure BDA00037943373100000518
Figure BDA00037943373100000519
Wherein argmax () is a function that finds the largest argument in the function; rho π (s) is the probability distribution function at any state s in strategy pi;
substitution function
Figure BDA00037943373100000520
And
Figure BDA00037943373100000521
is distinguished by a substitution function
Figure BDA00037943373100000522
Ignoring the change in state access density caused by the policy change,
Figure BDA00037943373100000523
using rho π As access frequency rather than as access frequency
Figure BDA00037943373100000524
Figure BDA00037943373100000525
Access frequency p of π Obtained by using an approximation to the strategy pi when the strategy pi is equal to
Figure BDA00037943373100000526
Substitution function when certain constraints are satisfied
Figure BDA00037943373100000527
Can replace the original desired jackpot function
Figure BDA00037943373100000528
In the parameter vector
Figure BDA00037943373100000529
In the updating of (2), using the parameter vector
Figure BDA00037943373100000530
Parameterizing strategy pi in the form of arbitrary parameter theta
Figure BDA00037943373100000531
Figure BDA00037943373100000532
Is in a parameterized strategy
Figure BDA00037943373100000533
Any action a in any state s; for any parameter θ, when the policy is not updated, the substitution function
Figure BDA00037943373100000534
And original cumulative reward function
Figure BDA00037943373100000535
Are exactly equal, i.e. there are:
Figure BDA00037943373100000536
wherein
Figure BDA00037943373100000537
Is that the strategy pi passes through a parameter vector
Figure BDA00037943373100000538
A parameterized policy;
when the derivative of the substitute function and the original cumulative reward function with respect to the arbitrary parameter theta is in the strategy pi θ Are identical, i.e. the policy is from
Figure BDA00037943373100000539
Is updated to
Figure BDA00037943373100000540
If there is a very small change, the function value is replaced
Figure BDA00037943373100000541
If the cumulative reward eta increases, then the strategy can be improved by using the alternative function as the optimization goal, namely:
Figure BDA00037943373100000542
wherein
Figure BDA00037943373100000543
Is a derivative of a function with respect to an arbitrary parameter θ;
equations (16) and (17) illustrate the strategy from
Figure BDA0003794337310000061
Is updated to
Figure BDA0003794337310000062
Is a step size small enough to increase accumulationA product reward eta; defining pi' as the strategy with the maximum accumulated reward value in the old strategy and defining the intermediate divergence variable
Figure BDA0003794337310000063
Setting a conservative iteration strategy pi for increasing the lower bound of the cumulative reward eta new (a | s) is:
π new (a|s)=(1-α)π old (a|s)+απ′(a|s) (18)
wherein pi new Is a new strategy; pi old Is the current policy;
Figure BDA0003794337310000064
is pi new And pi old Maximum total variation divergence therebetween; pi old (. S) is in strategy pi old Selected action in any state s; pi new (. S) is in strategy π new The selected action in any state s; d TVold (·|s)||π new (. S)) is π old (. S) and π new Total divergence of variation between (· | s); π '(a | s) is an arbitrary action a selected in an arbitrary state s in strategy π'; pi old (as) is in strategy π old Any action a selected in any state s;
for any random strategy, let the intermediate entropy variable ε = max s,a |A π (s, a) |, where max s,a The absolute value of the function for selecting any action a in any state s is | |; by using
Figure BDA0003794337310000065
Substitution of pi new Replacement of pi by pi old (ii) a Substitute function value L π And the jackpot η satisfy:
Figure BDA0003794337310000066
wherein γ is a discount factor;
maximum relative entropy of
Figure BDA0003794337310000067
Where π (· | s) is an action selected in an arbitrary state s in policy π;
Figure BDA0003794337310000068
is in the policy
Figure BDA0003794337310000069
Selected action in any state s;
Figure BDA00037943373100000610
is pi (. | s) and
Figure BDA00037943373100000611
relative entropy between;
the relation between the total variation divergence and the relative entropy satisfies
Figure BDA00037943373100000612
Wherein
Figure BDA00037943373100000613
Is the total variation divergence between π (· | s) and π (· | s);
order to
Figure BDA00037943373100000614
Re-constraining the relative entropy:
Figure BDA00037943373100000615
wherein C is a penalty coefficient;
under the constraint condition, the strategy pi for continuous update 0 →π 1 →...→π X Existence of eta (pi) 0 )≤η(π 1 )≤...≤η(π X ) (ii) a Where → denotes a policy update procedure; pi 01 ,...,π X Is a strategy sequence of the parallel trust optimization strategy network; eta (pi) 0 ),η(π 1 ),...,η(π X ) Is in parallelAccumulated reward of each strategy in the strategy sequence of the trust optimization strategy network;
considering parameterized strategies
Figure BDA00037943373100000616
And a parameter vector
Figure BDA00037943373100000617
Pruning and parameter vectors
Figure BDA00037943373100000618
An unrelated item;
expected cumulative reward function after conversion of parameter variables
Figure BDA00037943373100000619
Comprises the following steps:
Figure BDA00037943373100000620
transformed surrogate function for parametric variables
Figure BDA00037943373100000621
Comprises the following steps:
Figure BDA00037943373100000622
relative entropy of transformed parametric variables
Figure BDA00037943373100000623
Comprises the following steps:
Figure BDA00037943373100000624
the constraint conditions after the parameter variable conversion are as follows:
Figure BDA0003794337310000071
wherein, = taking the equivalent value after variable conversion;
Figure BDA0003794337310000072
is the parameter vector that needs to be updated;
Figure BDA0003794337310000073
is a parameter vector
Figure BDA0003794337310000074
An updated parameter vector;
Figure BDA0003794337310000075
is a policy pi passing parameter vector
Figure BDA0003794337310000076
A parameterized policy;
Figure BDA0003794337310000077
is a policy pi passing parameter vector
Figure BDA0003794337310000078
A parameterized policy;
Figure BDA0003794337310000079
is a strategy
Figure BDA00037943373100000710
Desired jackpot function;
Figure BDA00037943373100000711
is a strategy
Figure BDA00037943373100000712
A substitution function of (a);
Figure BDA00037943373100000713
is that
Figure BDA00037943373100000714
And
Figure BDA00037943373100000715
relative entropy between;
Figure BDA00037943373100000716
is the maximum value of the relative entropy after the parameter variable is converted;
obtaining parallel policy optimization network parameter vector from formulas (21) to (24)
Figure BDA00037943373100000717
The updating process of (3); by a parameter vector
Figure BDA00037943373100000718
The updating of the operation can optimize the selection weight of the action, thereby achieving the purpose of optimizing the parallel control;
to ensure the jackpot eta is increased, make
Figure BDA00037943373100000719
Maximization; because C is used as a penalty coefficient, the result is that each time
Figure BDA00037943373100000720
Becomes small, resulting in a short step per update, reducing the update speed, so the penalty term becomes the constraint term:
Figure BDA00037943373100000721
wherein δ is a constant;
equation (14) is based on policy
Figure BDA00037943373100000722
Sampling is performed due to pre-update policies
Figure BDA00037943373100000723
Is unknown and cannot be policy-based
Figure BDA00037943373100000724
Sampling, so using importance sampling to accumulate the reward function for parameterization
Figure BDA00037943373100000725
Rewriting is carried out; for parameterized jackpot functions
Figure BDA00037943373100000726
Ignoring terms not related to any parameter theta and using
Figure BDA00037943373100000727
Instead of the former
Figure BDA00037943373100000728
Finally, the updating of the parallel trust optimization strategy network becomes:
Figure BDA00037943373100000729
wherein
Figure BDA00037943373100000730
Is to parameterized about the policy
Figure BDA00037943373100000731
Probability distribution of
Figure BDA00037943373100000732
And state-action value
Figure BDA00037943373100000733
Sampling is carried out;
Figure BDA00037943373100000734
is in the policy
Figure BDA00037943373100000735
Middle state s t Is in an arbitrary state s, action a t A state-action value function for any action a;
parameter vector according to set constraint
Figure BDA00037943373100000736
Updating, namely updating the strategy pi by using the updated parameter vector to complete strategy updating in the parallel strategy optimization network, and then selecting actions by using a new strategy in the current state to perform step-by-step iteration;
and (9): after iteration is judged, the network is optimized according to the trained parallel trust strategy, and the power variation delta P of each power generation area of the novel power system is regulated and controlled Gi Each area of the novel power system reaches the optimal tie line power exchange assessment index CPS; each power generation area can reach the optimal tie line power exchange assessment index CPS by the method from the step (1) to the step (8); through training of networks in each power generation area, area cooperation is sought to achieve dynamic balance, finally, the frequency deviation delta f between the power generation areas approaches 0, the power exchange assessment index CPS approaches 100%, and the whole novel power system gradually achieves global optimization.
Compared with the prior art, the invention has the following advantages and effects:
(1) In a distributed system, all modules are mutually independent, the whole system is a multi-line parallel framework, the whole normal operation cannot be influenced due to the fact that one module has a problem, and the robustness is high; the real-time meteorological prediction network is added into the novel power system, so that the novel power system is fully interactive with the environment, the novel power system can accurately track the environment and perform intelligent power generation control and regulation.
(2) The weather prediction neural network can reduce the feature dimension of weather data by adding the stack self-coding neural network, and can accelerate the training speed of the network. (3) Compared with the existing strategy optimization network, the parallel trust strategy optimization network can reduce the dimension of the expected value table and eliminate the dimension disaster.
Drawings
FIG. 1 is a block diagram of the meteorological prediction distribution parallel trust policy optimization of the method of the present invention.
FIG. 2 is a control flow diagram of the multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation of the method of the present invention.
Fig. 3 is a block diagram of a stacked self-coding network of the method of the present invention.
FIG. 4 is a block diagram of a gated loop unit of the method of the present invention.
Detailed Description
The invention provides a multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method, which is explained in detail by combining the accompanying drawings as follows:
FIG. 1 is a framework diagram of parallel trust strategy optimization of meteorological prediction distribution in the method of the present invention.
First, each controlled power generation area is defined as a controlled Agent, and each area is marked as { Agent 1 ,Agent 2 ,…,Agent i };
Secondly, initializing parameters of a stack self-coding neural network and a gate control cycle unit in a meteorological prediction neural network, inputting meteorological data of the previous year, extracting meteorological features, and inputting the meteorological features into the stack self-coding neural network and the gate control cycle unit respectively;
then, training the stack self-coding neural network to obtain a parameter initial value for training the gating circulation unit, predicting future meteorological data by using the trained gating circulation unit, and finishing the training if the prediction effect reaches the standard;
then, inputting meteorological data to be predicted into a gating circulation unit through a stack self-coding neural network to obtain a prediction result, and inputting the prediction result into a novel power system;
then, setting three parallel trust strategy optimization networks of short-time distance, medium-time distance and long-time distance in each power generation area, wherein the short-time distance is one day, the medium-time distance is fifteen days, and the long-time distance is three months;
then, initializing system parallel trust strategy optimization network parameters, setting parallel trust strategy optimization network strategies, initializing parallel expectation value tables in the parallel trust strategy optimization network, setting the initial expectation value as 0, setting the search times as V, and setting the iteration times as X;
then, pre-training the parallel trust strategy optimization network, and inputting the initial values of the pre-trained network parameters into the parallel trust strategy optimization network;
then, under the current state, the parallel trust strategy optimization network in each intelligent agent depends on the strategy selection action to obtain the reward value corresponding to the action under the current environment, the obtained reward value is fed back to the parallel expected value table, meanwhile, the iteration number is increased by one, and whether the current iteration number is equal to X or not is judged; if the iteration times are not equal to X, updating the parallel expected value table, updating the time difference error in the experience pool, and updating the optimization strategy; if the iteration times are equal to X, the parallel trust strategy optimization network training is completed;
and finally, controlling the novel power system according to the trained parallel trust strategy optimization network, and regulating and controlling the power output of each agent to ensure that the novel power system reaches the optimal tie line power exchange assessment index CPS.
FIG. 2 is a control flow diagram of the multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation of the method of the present invention.
The method comprises the following steps that firstly, the meteorological prediction neural network part is operated, the gating circulation unit is trained by using meteorological data in the previous period, and then the trained gating circulation unit is used for predicting future meteorology from the meteorological data in the current period;
then, using the Agent of the novel power system 1 For example, agent 1 Three parallel trust strategy optimization networks, agents with short time interval, medium time interval and long time interval are arranged in the network 1 Receiving meteorological data output from the prediction neural network, and passing through the Agent 1 Frequency deviation Δ f of 1 Area control error ACE 1 And a tie line power exchange assessment index CPS, and selecting actions according to strategies in each parallel trust strategy optimization network;
and finally, updating the parallel expected value table, updating the time difference error in the experience pool, updating the optimization strategy, and circularly obtaining the Agent 1 The optimal tie line power exchange assessment index CPS;
except for Agent 1 And other power generation areas can obtain the optimal tie line power exchange assessment index CPS of the area by the method.
Fig. 3 is a block diagram of a stacked self-coding network of the method of the present invention.
Firstly, initializing model parameters, and inputting a current meteorological data set into a self-coding neural network; establishing an initial self-coding neural network to compress meteorological data from original n dimensions to m dimensions;
then, neglecting the input state y of the self-coding neural network, taking the hidden layer h as an original novel type, training a new self-coder, and stacking layer by layer to reduce the characteristic dimension of data and simultaneously keep key information;
then, comparing the trained data with the actual data, calculating a loss function, and updating system parameters;
and finally, inputting the trained initial parameter values into a self-coding neural network.
FIG. 4 is a block diagram of a gated loop unit of the method of the present invention.
Firstly, initializing model parameters, and inputting a current meteorological data set into a gating cycle unit;
then, the short-term dependency relationship in the gate capturing time sequence is reset, and the long-term dependency relationship in the gate capturing time sequence is updated, so that the network parameters are updated;
and finally, using the trained network for predicting the meteorological data at the current stage.

Claims (1)

1. A multi-time-interval meteorological prediction distribution parallel trust strategy optimization power generation control method is characterized in that the method combines multi-time-scale meteorological prediction, distribution parallel and trust strategy optimization neural networks for power generation control of a novel power system; the method for optimizing the power generation control by the multi-time-distance meteorological prediction distribution parallel trust strategy comprises the following steps in the use process:
step (1): defining each controlled power generation area as an Agent, and marking each area as { Agent 1 ,Agent 2 ,…,Agent i },
Wherein i is the number of each power generation region; all power generation areas are not interfered with each other and are mutually connected, so that the robustness is high;
step (2): initializing parameters of a stack self-coding neural network and a gating cycle unit, collecting a meteorological wind intensity and illumination intensity data set in three years, extracting a meteorological characteristic data set, and inputting the meteorological characteristic data set into the stack self-coding neural network;
the stacked self-coding neural network is formed by stacking a plurality of self-coding neural networks, wherein x is an input meteorological characteristic data vector, x is an n-dimensional vector, and x belongs to R n (ii) a Will self-encode neural networks AE 1 Hidden layer h of (1) As a self-encoding network AE 2 Training the self-encoding network AE 2 Then the self-encoding network AE is further processed 2 Is hidden layer h (2) As a self-encoding network AE 3 The input of (2) and so on; after stacking layer by layer, the feature dimension of the weather data is reduced, the training speed of the gate control circulation unit can be accelerated, and the key information of the data can be stored; the dimensions of each hidden layer are different, so that the hidden layers have the following dimensions:
Figure FDA0003794337300000011
wherein h is (1) Is a self-coding neural network AE 1 Hidden layer of h (2) Is a self-coding neural network AE 2 Hidden layer of h (p-1) Is a self-coding neural network AE p-1 Hidden layer of h (p) Is a self-coding neural network AE p The hidden layer of (2); w is a group of (1) Is a hidden layer h (1) Parameter matrix of, W (2) Is a hidden layer h (2) Parameter matrix of, W (p) Is a hidden layer h (p) A parameter matrix of (2); b is a mixture of (1) Is a self-coding neural network AE 1 Bias of (b), b (2) Is a self-coding neural network AE 2 Bias of (b) (p) Is a self-coding neural network AE p Bias of (3); f () is activatedA function; p is the number of layers of the stacked self-coding network stack; softmax () is a normalized exponential function, used as a classifier;
in a stacked self-coding neural network, if h (1) Is m dimension, h (2) From self-coding network AE for k dimension 1 Stack to self-encoded network AE 2 Training a network of n → m → k structure; training the network n → m → n to get the transformation of n → m, then training the network m → k → m to get the transformation of m → k, and finally self-encoding the network AE 1 And self-encoding network AE 2 Stacking to obtain a network n → m → k; through a self-encoding network AE 1 To AE p Stacking layer by layer, and finally obtaining an output vector through a softmax function
Figure FDA0003794337300000012
Obtaining a certain initial value of network parameters and meteorological features after dimension reduction after training of a stack self-coding neural network
Figure FDA0003794337300000013
As an output;
and (3): order to
Figure FDA0003794337300000014
Output vector for time t self-coding neural network
Figure FDA0003794337300000015
Pre-training the stack self-coding neural network to obtain the initial values of the network parameters and the meteorological features
Figure FDA0003794337300000016
The inputs of the refresh gate and the reset gate in the gated cyclic unit are set as
Figure FDA0003794337300000017
The outputs of the update gate and the reset gate in the gated loop unit are respectively:
Figure FDA0003794337300000018
wherein z is t Updating the output of the gate for time t, r t Resetting the output of the gate for time t, h t-1 For gating the hidden state of the cyclic unit at time (t-1), x t For the input at the time of the t-time, 2 [ 2 ]]Denotes that two vectors are connected, W z To update the weight matrix of the gate, W r σ () is a sigmoid function, which is a weight matrix of the reset gate;
the gate control circulation unit discards and memorizes the input information through two gates to obtain a candidate hidden state value at the time t
Figure FDA0003794337300000019
Comprises the following steps:
Figure FDA00037943373000000110
wherein tanh () represents a tanh activation function,
Figure FDA00037943373000000111
as a hidden state value
Figure FDA00037943373000000112
A weight matrix of (a); * Representing a product of the matrices;
after the tanh activation function obtains updated state information through the update gate, vectors of all possible values are created according to the input, and candidate hidden state values are obtained through calculation
Figure FDA00037943373000000113
Then the state h at the time t is calculated through the network t ,h t Comprises the following steps:
Figure FDA0003794337300000021
the reset gate determines the number of past states that are desired to be remembered; when r is t When 0, the state information h at the time (t-1) t-1 Can be forgotten to be in a hidden state
Figure FDA0003794337300000022
Will be reset to the information input at time t; the update gate determines the number of past states in the new state; when z is t When it is 1, the hidden state
Figure FDA0003794337300000023
Update to state h at time t t (ii) a The gate control circulation unit reserves important features through a gate function by updating and resetting the gate storage and filtering information, and captures dependence items through learning so as to obtain an optimal meteorological predicted value;
and (4): after the training of the stack self-coding neural network and the gating circulation unit is finished, meteorological data to be predicted are input into the gating circulation unit through the stack self-coding neural network, and the obtained meteorological predicted value is input into the novel power system; in a novel power system, three parallel trust strategy optimization networks of short-time distance, medium-time distance and long-time distance are arranged in each power generation area, wherein the short-time distance is one day, the medium-time distance is fifteen days, and the long-time distance is three months;
and (5): initializing parameters of a parallel trust strategy optimization network in each region, setting a parallel trust strategy optimization network strategy, and initializing a parallel expectation value table in the parallel trust strategy optimization network, wherein the initial expectation value is 0;
and (6): setting the iteration frequency as X, setting an initial value of the search frequency as a positive integer V, and initializing the search frequency of each intrinsic action as V = V;
and (7): in the current state, the parallel trust strategy optimization network in each intelligent agent selects an action by means of a strategy to obtain an award value corresponding to the action in the current environment, feeds the obtained award value back to a parallel expectation value table, and then adds one to the iteration number; if the current iteration times are equal to X, the iteration is completed, and a trained parallel trust strategy optimization network is obtained;
and (8): carrying out strategy optimization and parallel value optimization in a parallel trust strategy optimization network, wherein the optimization method comprises the following steps:
the core of the parallel trust optimization strategy network is an actor-critic method; in policy optimization of a parallel belief optimization policy network, the Markov decision process is the tuple (S, A, P, r, ρ) 0 Gamma), wherein S is a state space consisting of wind power intensity, illumination intensity, frequency deviation delta f, area control error ACE and tie line power exchange assessment index CPS, and any state S belongs to S; a is the power variation Δ P of different magnitude Gi ,i=1,2,...,2 j A formed motion space, where j is a quantum stacking state motion | A>Any action a belongs to A; p is a transition probability distribution matrix for transitioning from an arbitrary state s to a state s' through an arbitrary action a; r () is a reward function; rho 0 Is in an initial state s 0 A probability distribution of (a); gamma is a discount factor; let π denote the random strategy π: S × A → [0,1](ii) a The desired jackpot function η (π) under policy π is:
Figure FDA0003794337300000024
wherein s is 0 Is in an initial state, a 0 Is in a state s 0 Action of pi selection, gamma, of lower random strategy t Discounting factor for time t, s t Is the state at time t, a t R(s) is the movement at time t t ) Is in a state s t The value of the prize to be awarded,
Figure FDA0003794337300000025
representing a state s in strategy pi t Downward action a t Sampling is carried out;
introducing a state-action value function Q π (s t ,a t ) Function of state value V π (s t ) Advantage function A π (s, a) and probability distribution function ρ π (s);
Function of state-action valueNumber Q π (s t ,a t ) Finding the state s in strategy π t Lower execution action a t Late jackpot, state-action value function Q π (s t ,a t ) Comprises the following steps:
Figure FDA0003794337300000026
wherein s is t+1 The state at time (t + 1); s t+l The state at time (t + l); a is t+1 An action at time (t + 1); a is t+l The action at time (t + l);
Figure FDA0003794337300000027
representing a state s in strategy pi t+l Downward action a t+l Sampling is carried out; gamma ray t+l A discount factor for time (t + l); l is a positive integer; r(s) t+l ) Is in a state s t+l A reward value of;
function of state value V π (s t ) Finding the state s in strategy π t Accumulated award of, V π (s t ) Is Q π (s t ,a t ) Regarding action a t Average value of (2), state value function V π (s t ) Comprises the following steps:
Figure FDA0003794337300000028
wherein
Figure FDA0003794337300000029
Expressing Q in the strategy pi π (s t ,a t ) Regarding action a t Average value of (a);
dominance function A π (s, a) to find the dominance of taking an arbitrary action a in an arbitrary state s compared to the average, a dominance function A π (s, a) is:
A π (s,a)=Q π (s,a)-V π (s) (8)
Q π (s, a) is the state s t Is in an arbitrary state s, action a t A state-action value function for any action a; v π (s) is a state s t Is a function of the state value at any state s;
probability distribution function ρ π (s) solving the probability distribution, probability distribution function rho, of the probability distribution in any state s in the strategy pi π (s) is:
Figure FDA0003794337300000031
wherein P(s) t = s) is the state s at time t t A transition probability distribution matrix when in an arbitrary state s;
the parallel value optimization of the parallel trust optimization strategy network is to a state-action value function Q π (s t ,a t ) Performing parallel optimization; function Q of state-action value π (s t ,a t ) Is optimized towards Q π (s t ,a t ) Introducing quantum bits and a Grover search method to accelerate the network training speed and eliminate the dimensionality disaster;
parallel value optimization will act a t Quantising and using in state s t Updating the searching times V instead of the state s t The action probability is updated as follows:
an action space is provided with 2 j An intrinsic action of 2 j An intrinsic action | a t >Is represented by the superposition of
Figure FDA0003794337300000032
Will act a t Quantization to j-dimensional quantum superposition state action | A>,|A>Is each qubit of |0>And |1>Superposition of the two states; quantum superposition state motion
Figure FDA0003794337300000033
Equivalent to | A>(ii) a j dimension quantum superposition dynamicMake | A>The expression of (c) is:
Figure FDA0003794337300000034
wherein | a>Is a j-dimensional quantum superposition state action | A>Observed quantum motion; c a Is a quantum motion | a>A probability amplitude; i C a | Quantum action | a>A modulus of (d) satisfies
Figure FDA0003794337300000035
When quantum superposition state motion | A > is observed, | A > will collapse to a quantum motion | a >, which is a |0> or |1> on each qubit of quantum motion | a >; each qubit of quantum action | a > has a value representing its desired value; the expected values of these qubits are different in different states; these expected values are used to select an action in a certain state, and the update rule of these expected values is as follows:
if in strategy π, in state s t Down, quantum superposition state action | A>Collapse into a quantum motion | a>(ii) a Action | a as the jackpot η increases>The qubit in each qubit is |0>Is reduced, the qubit is |1>The expected value of (d) increases; updating the increment and decrement of the expected value according to strategy pi, and setting the quantum position with positive expected value of each quantum position of the action as |1 when the action is selected in the same state>And the rest quantum positions are |0>Obtaining quantum motion; let the parameter vector of strategy pi be
Figure FDA0003794337300000036
Normalizing the quantum motion to a specific motion value and passing through a parameter vector
Figure FDA0003794337300000037
Converted into power variation quantity delta P G1 As Agent in this state 1 Outputting; wherein theta is 01 ,...,θ u As a vector of parameters
Figure FDA0003794337300000038
U is a positive integer;
the quantum motion selection is to obtain the state s at the time t by a Grover search method t Observing quantum superposition state action | A>Quantum action | a resulting from collapsing>The obtaining and updating rules of the search times V are as follows:
first, all intrinsic actions are superimposed with equal weight, and j Hadamard gates are sequentially applied to j independent qubits in initial states of |0>, resulting in:
Figure FDA0003794337300000041
wherein H is a Hadamard gate, H being capable of coupling the ground state |0>Conversion to equal weight superposition
Figure FDA0003794337300000042
Is the initialized quantum superposition state action, which is composed of 2 j The actions with the same probability amplitude are superposed together;
Figure FDA0003794337300000043
indicating that j Hadamard gates are sequentially applied to 2 j An initialization intrinsic action; two parts of the Grover iterative operator are U |a> And
Figure FDA0003794337300000044
respectively as follows:
Figure FDA0003794337300000045
where I is a unitary matrix of appropriate dimensions;<a | is a quantum action | a>Reverse quantum action of (d);<0| is a quantum action |0>The reverse quantum action of (2);
Figure FDA0003794337300000046
is a quantum motion
Figure FDA0003794337300000047
The reverse quantum action of (2); | a><a | is a quantum action | a>The outer product of (2);
Figure FDA0003794337300000048
is initialized quantum superposition state action
Figure FDA0003794337300000049
The outer product of (d); u shape |a> And
Figure FDA00037943373000000410
is a quantum black box; u shape |a> Acting on the intrinsic action | a t >While, U |a> Change and | a>The phase of the upward state in the same direction is changed by 180 degrees;
Figure FDA00037943373000000411
acting on the intrinsic action | a t >When the utility model is used, the water is discharged,
Figure FDA00037943373000000412
change and
Figure FDA00037943373000000413
the phase of the upward state in the same direction is changed by 180 degrees;
record Grover iteration as U Grov
Figure FDA00037943373000000414
According to U Grov Iterating each intrinsic action, obtaining the closest iteration number under the least iteration number, and updating the iteration number to the iteration corresponding to each intrinsic actionThe times V are counted, and the strategy pi is updated at the same time;
from equation (9), the state s at time t t At any state s, the strategy is updated from pi
Figure FDA00037943373000000415
Expected cumulative reward function of time
Figure FDA00037943373000000416
Comprises the following steps:
Figure FDA00037943373000000417
wherein
Figure FDA00037943373000000418
Is an updated policy; p(s) t = s) as being in policy
Figure FDA00037943373000000419
State s at time t t A transition probability distribution matrix at any state s; a. The π (s, a) is in the policy
Figure FDA00037943373000000420
State s at the next t t Is a merit function at any state s;
Figure FDA00037943373000000421
is in the policy of
Figure FDA00037943373000000422
State s at time t t A transition probability distribution matrix at any state s;
Figure FDA00037943373000000423
is in the policy of
Figure FDA00037943373000000424
Middle state s t Downward pair action a t Sampling is carried out;
Figure FDA00037943373000000425
is in the policy
Figure FDA00037943373000000426
Sampling value a in any state s;
Figure FDA00037943373000000427
is in the policy of
Figure FDA00037943373000000428
Probability distribution in any state s; a. The π (s t ,a t ) Is the state s at time t t Take action a t Superiority over average; η (π) is the desired jackpot function under strategy π;
at an arbitrary state s, there are
Figure FDA0003794337300000051
Wherein
Figure FDA0003794337300000052
For updated policy
Figure FDA0003794337300000053
State s at time t t The selected action value can be used for improving the accumulated reward eta, or the accumulated reward eta is kept unchanged under the condition that the expected advantage is zero, so that the strategy is continuously updated to optimize the accumulated reward eta;
since the updated policy needs to be calculated in equation (14)
Figure FDA0003794337300000054
Probability distribution of
Figure FDA0003794337300000055
This results in higher complexity of equation (7), which is difficult to optimize, and introduces a substitution function to reduce the complexity of the calculation
Figure FDA0003794337300000056
Figure FDA0003794337300000057
Wherein argmax () is a function that finds the largest argument in the function; rho π (s) is the probability distribution function at any state s in strategy pi;
substitution function
Figure FDA0003794337300000058
And
Figure FDA0003794337300000059
is distinguished by a substitution function
Figure FDA00037943373000000510
Ignoring the change in state access density caused by the policy change,
Figure FDA00037943373000000511
using rho π As access frequency rather than as access frequency
Figure FDA00037943373000000512
Figure FDA00037943373000000513
Access frequency p of π Obtained by using an approximation to the strategy π when the strategy π is compared with
Figure FDA00037943373000000514
Substitution function when certain constraints are satisfied
Figure FDA00037943373000000515
Can replace the original desired jackpot function
Figure FDA00037943373000000516
In the parameter vector
Figure FDA00037943373000000517
In the updating of (2), using the parameter vector
Figure FDA00037943373000000518
Parameterizing strategy pi in the form of arbitrary parameter theta
Figure FDA00037943373000000519
Figure FDA00037943373000000520
Is in a parameterized strategy
Figure FDA00037943373000000521
Any action a in any state s; for any parameter θ, when the policy is not updated, the substitution function
Figure FDA00037943373000000522
And original cumulative reward function
Figure FDA00037943373000000523
Are exactly equal, i.e. have:
Figure FDA00037943373000000524
wherein
Figure FDA00037943373000000525
Is a policy pi passing parameter vector
Figure FDA00037943373000000526
A parameterized policy;
when the derivatives of the alternative function and the original cumulative reward function with respect to any parameter theta are in the strategy pi θ Are identical, i.e. the policy is from
Figure FDA00037943373000000527
Is updated to
Figure FDA00037943373000000528
If there is a very small change, the function value is replaced
Figure FDA00037943373000000529
If the cumulative reward eta increases, then the strategy can be improved by using the alternative function as the optimization goal, namely:
Figure FDA00037943373000000530
wherein
Figure FDA00037943373000000531
Is a derivative of a function with respect to an arbitrary parameter θ;
equations (16) and (17) illustrate the strategy from
Figure FDA00037943373000000532
Is updated to
Figure FDA00037943373000000533
Is a step small enough to increase the jackpot η; defining pi' as the strategy with the maximum accumulated reward value in the old strategy and defining the intermediate divergence variable
Figure FDA00037943373000000534
Setting a conservative iteration strategy pi to increase the lower bound of the cumulative reward eta new (a | s) is:
π new (a|s)=(1-α)π old (a|s)+απ′(a|s) (18)
wherein pi new Is a new strategy; pi old Is the current policy;
Figure FDA00037943373000000535
is pi new And pi old Maximum total variation divergence between; pi old (. S) is in strategy pi old Selected action in any state s; pi new (. S) is in strategy π new The selected action in any state s; d TVold (·|s)||π new (. S)) is π old (. S) and π new Total divergence of variation between (· | s); π '(a | s) is an arbitrary action a selected in an arbitrary state s in strategy π'; pi old (as) is in strategy π old Any action a selected in any state s;
for any random strategy, let the intermediate entropy variable ε = max s,a |A π (s, a) |, where max s,a | | is the maximum absolute value of a function for selecting any action a in any state s; by using
Figure FDA00037943373000000536
Substitution of pi new Replacement of pi by pi old (ii) a Substitute function value L π And the jackpot η satisfy:
Figure FDA00037943373000000537
wherein γ is a discount factor;
maximum relative entropy of
Figure FDA0003794337300000061
Where π (· | s) is an action selected in an arbitrary state s in policy π;
Figure FDA0003794337300000062
is in the policy of
Figure FDA0003794337300000063
Selected action in any state s;
Figure FDA0003794337300000064
is π (· | s) and
Figure FDA0003794337300000065
relative entropy between;
the relation between the total variation divergence and the relative entropy satisfies
Figure FDA0003794337300000066
Wherein
Figure FDA0003794337300000067
Is the total variation divergence between π (· | s) and π (· | s);
order to
Figure FDA0003794337300000068
Re-constraining the relative entropy:
Figure FDA0003794337300000069
wherein C is a penalty coefficient;
under the constraint condition, the strategy pi for continuous update 0 →π 1 →...→π X In the presence of eta (pi) 0 )≤η(π 1 )≤...≤η(π X ) (ii) a Where → denotes a policy update procedure; pi 01 ,...,π X Is a strategy sequence of the parallel trust optimization strategy network; eta (pi) 0 ),η(π 1 ),...,η(π X ) The accumulated reward of each strategy in the strategy sequence of the parallel trust optimization strategy network is obtained;
considering a parameterized policy
Figure FDA00037943373000000610
And a parameter vector
Figure FDA00037943373000000611
Pruning and parameter vectors
Figure FDA00037943373000000612
An unrelated item;
expected cumulative reward function after conversion of parameter variables
Figure FDA00037943373000000613
Comprises the following steps:
Figure FDA00037943373000000614
transformed surrogate function for parametric variables
Figure FDA00037943373000000615
Comprises the following steps:
Figure FDA00037943373000000616
relative entropy of transformed parametric variables
Figure FDA00037943373000000617
Comprises the following steps:
Figure FDA00037943373000000618
the constraint conditions after the parameter variable conversion are as follows:
Figure FDA00037943373000000619
wherein = taking the equivalent value after variable conversion;
Figure FDA00037943373000000620
is the parameter vector that needs to be updated;
Figure FDA00037943373000000621
is a parameter vector
Figure FDA00037943373000000622
An updated parameter vector;
Figure FDA00037943373000000623
is a policy pi passing parameter vector
Figure FDA00037943373000000624
A parameterized policy;
Figure FDA00037943373000000625
is that the strategy pi passes through a parameter vector
Figure FDA00037943373000000626
A parameterized policy;
Figure FDA00037943373000000627
is a strategy
Figure FDA00037943373000000628
Desired jackpot function;
Figure FDA00037943373000000629
is a strategy
Figure FDA00037943373000000630
A substitution function of (a);
Figure FDA00037943373000000631
is that
Figure FDA00037943373000000632
And
Figure FDA00037943373000000633
relative entropy between;
Figure FDA00037943373000000634
is the maximum value of the relative entropy after the parameter variable is converted;
obtaining parallel policy optimization network parameter vector from formulas (21) to (24)
Figure FDA00037943373000000635
The updating process of (1); by a parameter vector
Figure FDA00037943373000000636
The updating of the operation can optimize the selection weight of the action, thereby achieving the purpose of optimizing the parallel control;
to ensure the jackpot eta is increased, make
Figure FDA00037943373000000637
Maximization; because C is used as a penalty coefficient, the result is that each time
Figure FDA00037943373000000638
Becomes small, resulting in a short step per update, reducing the update speed, so the penalty term becomes the constraint term:
Figure FDA00037943373000000639
wherein δ is a constant;
equation (14) is based on policy
Figure FDA00037943373000000640
To carry out miningAlso, due to pre-update policies
Figure FDA00037943373000000641
Is unknown and cannot be policy-based
Figure FDA00037943373000000642
Sampling, so using importance sampling to a parameterized cumulative reward function
Figure FDA0003794337300000071
Rewriting is carried out; for parameterized jackpot functions
Figure FDA0003794337300000072
Ignoring terms not related to any parameter theta and using
Figure FDA0003794337300000073
Substitute for
Figure FDA0003794337300000074
Finally, the updating of the parallel trust optimization strategy network becomes:
Figure FDA0003794337300000075
wherein
Figure FDA0003794337300000076
Is to the parameterized about policy
Figure FDA0003794337300000077
Probability distribution of
Figure FDA0003794337300000078
And state-action value
Figure FDA0003794337300000079
To proceed withSampling;
Figure FDA00037943373000000710
is in the policy
Figure FDA00037943373000000711
Middle state s t Is in an arbitrary state s, action a t A state-action value function for any action a;
vector of parameters according to set constraints
Figure FDA00037943373000000712
Updating, namely updating the strategy pi by using the updated parameter vector to complete strategy updating in the parallel strategy optimization network, and then selecting actions by using a new strategy in the current state to perform step-by-step iteration;
and (9): after iteration is judged to be completed, the network is optimized according to the trained parallel trust strategy, and the power variation delta P of each power generation area of the novel power system is regulated and controlled Gi Each area of the novel power system reaches the optimal tie line power exchange assessment index CPS; each power generation area can reach the optimal tie line power exchange assessment index CPS by the method from the step (1) to the step (8); through training of networks in each power generation area, area cooperation is sought to achieve dynamic balance, finally, the frequency deviation delta f between the power generation areas approaches 0, the power exchange assessment index CPS approaches 100%, and the whole novel power system gradually achieves global optimization.
CN202210967237.1A 2022-08-12 2022-08-12 Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method Pending CN115238592A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210967237.1A CN115238592A (en) 2022-08-12 2022-08-12 Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210967237.1A CN115238592A (en) 2022-08-12 2022-08-12 Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method

Publications (1)

Publication Number Publication Date
CN115238592A true CN115238592A (en) 2022-10-25

Family

ID=83678600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210967237.1A Pending CN115238592A (en) 2022-08-12 2022-08-12 Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method

Country Status (1)

Country Link
CN (1) CN115238592A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117040030A (en) * 2023-10-10 2023-11-10 国网浙江宁波市鄞州区供电有限公司 New energy consumption capacity risk management and control method and system
CN118297364A (en) * 2024-06-06 2024-07-05 贵州乌江水电开发有限责任公司 Production scheduling system and method for watershed centralized control hydropower station

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117040030A (en) * 2023-10-10 2023-11-10 国网浙江宁波市鄞州区供电有限公司 New energy consumption capacity risk management and control method and system
CN117040030B (en) * 2023-10-10 2024-04-02 国网浙江宁波市鄞州区供电有限公司 New energy consumption capacity risk management and control method and system
CN118297364A (en) * 2024-06-06 2024-07-05 贵州乌江水电开发有限责任公司 Production scheduling system and method for watershed centralized control hydropower station

Similar Documents

Publication Publication Date Title
Wu et al. Combined model with secondary decomposition-model selection and sample selection for multi-step wind power forecasting
CN108280551B (en) Photovoltaic power generation power prediction method utilizing long-term and short-term memory network
Shen et al. Wind speed prediction of unmanned sailboat based on CNN and LSTM hybrid neural network
Kim et al. Particle swarm optimization-based CNN-LSTM networks for forecasting energy consumption
CN111340273A (en) Short-term load prediction method for power system based on GEP parameter optimization XGboost
CN113053115A (en) Traffic prediction method based on multi-scale graph convolution network model
CN106600059A (en) Intelligent power grid short-term load predication method based on improved RBF neural network
CN115238592A (en) Multi-time-interval meteorological prediction distribution parallel trust strategy optimized power generation control method
CN106971237A (en) A kind of Medium-and Long-Term Runoff Forecasting method for optimized algorithm of being looked for food based on bacterium
Jin et al. Adaptive forecasting of wind power based on selective ensemble of offline global and online local learning
Fan et al. Multi-objective LSTM ensemble model for household short-term load forecasting
He et al. Nonparametric probabilistic load forecasting based on quantile combination in electrical power systems
CN110738363B (en) Photovoltaic power generation power prediction method
CN116911459A (en) Multi-input multi-output ultra-short-term power load prediction method suitable for virtual power plant
CN115481788B (en) Phase change energy storage system load prediction method and system
Sang et al. Ensembles of gradient boosting recurrent neural network for time series data prediction
CN109408896B (en) Multi-element intelligent real-time monitoring method for anaerobic sewage treatment gas production
Song et al. Study on GA-based training algorithm for extreme learning machine
CN118214025A (en) Power generation regulation and control method of 100% new energy power system considering sample shortage
CN117893043A (en) Hydropower station load distribution method based on DDPG algorithm and deep learning model
Zhuang et al. Research on quantitative stock selection strategy based on CNN-LSTM
Luo et al. Prediction of the Stock Adjusted Closing Price Based On Improved PSO-LSTM Neural Network
CN113420492A (en) Modeling method for frequency response model of wind-solar-fire coupling system based on GAN and GRU neural network
Dinh et al. End-to-End Learning for Fair Multiobjective Optimization Under Uncertainty
Zhou et al. An ultra-short-term wind power prediction method based on CNN-LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination