CN108008627A

CN108008627A - A kind of reinforcement learning adaptive PID control method of parallel optimization

Info

Publication number: CN108008627A
Application number: CN201711325553.4A
Authority: CN
Inventors: 孙歧峰; 任辉; 段友祥; 李洪强
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2018-05-08
Anticipated expiration: 2037-12-13
Also published as: CN108008627B

Abstract

The invention discloses a kind of reinforcement learning adaptive PID control method of parallel optimization, it is characterised in that comprises the following steps：Step S1：With matlab softwares, transmission function discretization, initialization controller parameter and M control thread are carried out by collateral learning by zero-order holder method；Step S2：Definition input signal passes to the transmission function in S1, calculates output valve, will input the input vector as control algolithm with the difference of output signal；Step S3：Input vector is passed to improved self-adaptive PID controller to be trained, trained model is obtained after iteration n times；Step S4：Test, record input, output signal, the changing value of pid parameter are controlled using trained model；Step S5：Visual testing data, control effect contrast.The invention preferably solves the problems, such as that conventional self-adaptive PID exists, the characteristic learnt using the multi-threaded parallel of A3C study, improves the stability and learning efficiency of algorithm.

Description

A kind of reinforcement learning adaptive PID control method of parallel optimization

Technical field

The present invention relates to a kind of Adaptive PID Control method, belongs to control class technical field, and specifically one kind is based on Improved self-adaptive PID (proportional-integral-differential) control algolithm of the actuator evaluator of parallel optimization.

Background technology

PID(Proportional/Integral/Differential；Proportional/integral/derivative) control system is a kind of Linear controller, is controlled according to deviation principle, since its principle is simple, strong robustness, adjusts simple and is not required to A kind of the advantages that obtaining the mathematical models of object, it has also become most common control system in Industry Control.In PID control In the engineering practice that systematic parameter is adjusted, especially with regard to it is linear, when constant, weak time lag system pid control parameter adjust In engineering practice, traditional setting method achieves rich experience, and is widely used.But in actual industrial process control In engineering practice processed, many control targets have the features such as time-varying Hurst index, purely retarded, and control process mechanism is more complicated； Under the influence of the factors such as noise, load disturbance, procedure parameter, even model structure, can change.Thus require Pid parameter can realize on-line tuning to meet the requirement controlled in real time.In such cases, traditional parameters setting method is then difficult to Meet the requirement of engineering practice, show significant limitation.

Adaptive PID Control technology is to solve a kind of effective way of problems.Adaptive PID Control model is drawn The advantages of both self adaptive control thought and conventional PID controller.First, it is adaptive controller, has automatic identification quilt Control process, automatic adjusting controller parameter, can adapt to the advantages that controlled process Parameters variation；Secondly, and there is Traditional PID control Device processed is simple in structure, robustness is good, high reliability.Due to such a advantage, becoming one in engineering practice The preferable industrial stokehold device of kind.After Adaptive PID Control is suggested, the research of extensive scholar is just received, is carried in succession Fuzzy Self-adaptive PID, Neural Network Adaptive PID Control device, Actor-Critic self-adaptive PID controllers are gone out.

Such as document 1：Liu Guorong, positive constitution favour Fuzzy Self-adaptive PIDs [J] is controlled and decision-making, in 1995 (6) The self-adaptive PID controller based on fuzzy rule is proposed, its main thought is：When system give mutation, go out present condition interference Or during structure interference, its transient response can be divided into 9 kinds of situations, after system response is obtained in each sampling instant, so that it may To deviate given situation and variation tendency according to the response of etching system at this time, knowledge is controlled according to existing system, with fuzzy Control method, it is appropriate to increase control dynamics or reduce control dynamics, to control response towards given direction change is deviateed, make output Tend to as early as possible given.But this control method needs the system that the experience of professional and parameter optimization could control complexity, The inaccurate control effect of fuzzy rule setting does not reach satisfied effect then.

2 Liao Fang of document virtues, Xiao build research [J] the Journal of System Simulation of based on BP neural network pid parameter Self-tuning System, 2005 propose the Adaptive PID Control based on BP neural network, its control thought is：Neural network identifier is by control deviation Neutral net self neural member is transferred back to, so that its own weights is corrected, setting input and the reality output of object of object By being counter-propagating to nerve network controller after identifier, it is modified network weight using error signal deviation, warp Repeatedly study is crossed, just can gradually keep up with the change of system.This method generally carries out parameter optimization, but teacher using supervised learning Signal is difficult to obtain.

3 Chen Xue of document pines, Adaptive PID Control [J] control theories of the Yang Yi people based on the study of actuator-evaluator with Using 2011 propose a kind of Adaptive PID Control of Actor-Critic structures.The control thought is：Utilize AC study Model-free on-line study ability, adaptively adjusts pid parameter, realizes the strategy of Actor at the same time using a RBF network The value function of function and Critic learn, and solve the deficiency that conventional PID controllers are not easy online setting parameter in real time, and have The advantages that response speed adaptive ability is strong.But the unstability of AC learning structures in itself, which often leads to algorithm, to be difficult to restrain.

Patent CN201510492758 discloses a kind of executing agency's Adaptive PID Control method, which combines Expert PID Controller and fuzzy controller and it is connected respectively with executing agency, executing agency is according to current state information And it is expected information selection Expert PID Controller or fuzzy controller, although this controller can reduce overshoot, Have the characteristics that control accuracy is high, but this controller still needs a large amount of prioris of professional, carrys out Decision Control The use of device.

The content of the invention

The object of the invention：The characteristics of for Adaptive PID Control, it is proposed that the actuator evaluator based on parallel optimization The method of the Adaptive PID Control (A3C) of habit, for the control in industry to system.The invention preferably solve it is conventional from Adapt to PID there are the problem of, using A3C study multi-threaded parallel learn characteristic, improve stability and the study of algorithm Efficiency.The advantages that self-adaptive PID controller based on A3C has fast response time, and adaptive ability is strong, strong antijamming capability.

The Adaptive PID Control method of actuator evaluator study based on parallel optimization, comprises the following steps：

Step S1：It is fixed with MATLAB (MATLAB, the business mathematics software that MathWorks companies of the U.S. produce) software The adopted continuous transmission function of an arbitrary order by control system, by zero-order holder method by its discretization, obtains one and makes by oneself The discretization transmission function of adopted time interval, initialization controller parameter and M control thread carry out collateral learning, wherein parameter Mainly include BP neural network parameter and PID control environmental parameter, each thread is an independent control agents；

Step S2：After step S1, the control object for initializing BP nerve nets weighting parameter and PID controller, one is defined A discrete input signal RIN, will be discrete after input signal according to definition time interval successively be passed to discretization after biography Delivery function, calculates the output valve of transmission function, and input and the difference of output signal are calculated as A3C Adaptive PID Controls The input vector x (t) of method；

Step S3：The input vector x (t) obtained in step S2 is passed to the A3C adaptive PID Controls put up Training is iterated, trained model is obtained after iteration n times；

Step S31：Calculate error current e (t), first-order error Δ e (t), second order error Δ e²(t) input as algorithm Vector x (t)=[e (t), Δ e (t), Δ²e(t)]^T, and it is normalized with sigmod function pairs；

Step S32：Input vector is passed to the Actor networks of each thread, and obtains the new parameters of PID.Actor Network is not the average and variance of the parameter value of direct output PID but the Gaussian Profile of output tri- parameters of PID, passes through three The Gaussian Profile of a parameter estimates three parameter values, o=1, when 2,3, output layer output be pid parameter average, o=4, When 5,6, output be pid parameter variance.Wherein Actor networks are totally 3 layers of a BP neural networks：1st layer is input layer, The 2nd layer of input for hidden layer

The output ho of hidden layer_k(t)=min (max (hi_k0), (t), 6) k=1,2,3 ... 20

3rd layer is output layer, the input of output layer

The output of output layer

Step S33：New pid parameter is assigned to controller, obtains control output, calculates control error, according to Environment reward function R (t) calculates award value.R (t)=α₁r₁(t)+α₂r₂(t) To the vector value x ' (t) of next state；

Step S34：By reward function R (t), current state vector x (t), next state vector x ' (t) passes to Critic networks, Critic network structures are similar with Actor network structures, and difference lies in output node to only have one.Critic The main output state value of network simultaneously calculates TD errors, δ_TD=r (t)+γ V (S_t+1,W_v′)-V(S_t,W_v′)；

Step S35：After TD errors are calculated, each Actor-Critic networks in A3C structures can't directly update The network weight of itself, but the Actor-Critic nets of renewal middle cerebral (Global-net) storage are removed with the gradient of itself Network parameter, update mode are W_v=W_v+α_cdW_v, wherein W_aFor the Actor network weights of middle cerebral storage, W '_aFor the power of the Actor networks of each AC structures Value, W_vFor the Critic network weights of middle cerebral storage, W '_vRepresent the Critic network weights of each AC structures, α_aFor The learning rate of Actor, α_cFor the learning rate of Critic, it is newest to pass to each one, AC structures for middle cerebral after the updating Parameter；

Step S36：Above to complete a training process, loop iteration n times, exit training, preservation model.

Step S4：Test is controlled using trained model, records input signal, exports signal, pid parameter Changing value；

Step S41：The input signal defined using step S1, is delivered to the highest thread of trained reward function Controlling model；

Step S42：Calculated after S41 it is current, once, second order error as input vector, be input to selected control Model, unlike training process, it is only necessary to the pid parameter adjustment amount of Actor networks output, and the PID after adjustment is joined Number passes to controller, obtains the output of controller；

Step S43：The input signal that step S42 is obtained is preserved, exports signal, and pid parameter changing value.

Step S5：Input signal using the experimental data that Matlab visualization steps S4 is obtained including controller, Export signal, the changing value of pid parameter, and and Fuzzy Adaptive PID Control, AC-PID Adaptive PID Controls be controlled effect Fruit contrasts.

Brief description of the drawings

Attached drawing 1 is process flow schematic diagram of the invention.

Attached drawing 2 is improved self-adaptive PID controller structure chart

Attached drawing 3 is as input signal, the output signal of improved controller using jump rank signal

Attached drawing 4 is the controlled quentity controlled variable of controller after improving

Attached drawing 5 is the control error for improving self-adaptive PID controller

Attached drawing 6 is the parameter adjustment curve of A3C self-adaptive PID controllers

Attached drawing 7 is controller after improvement and fuzzy, the comparison of AC structure adaptive PID controllers

The different controller control Experimental comparisons of attached drawing 8 and analysis

Embodiment

1-5 below in conjunction with the accompanying drawings, with MATLAB softwares, the invention will be further described：Based on parallel optimization The Adaptive PID Control of actuator evaluator study, specific embodiment include the following steps that step is as shown in Figure 1：

(1) parameter initialization.Elected as by control systemOne third order transfer function, it is discrete Time is set to 0.001s, use Z change discretization after transmission function for：Yourt (k)=- den (2) yourt (k-1)-den (3) yourt (k-2)-den (4) yourt (k-1)+num (2) u (k-1)+num (3) u (k-2)+num (4) u (k-3), input letter Number for jump rank signal of the value equal to 1.0, single train epochs be 1000 steps, time 1.0s, initialize 4 threads and represent 4 Independent self-adaptive PID controller, is trained.

(2) input vector is calculated.E (t)=rin (0)-yourt (0)=1.0 during t=0；E (t-1)=0；E (t-2)=0 Input vector x (t)=[e (t), Δ e (t), Δ²e(t)]^TWherein e (t)=rin-yourt=1.0 Δ e (t)=e (t)-e (t- 1)=1.0 Δ²E (t)=e (t) -2*e (t-1)+e (t-2)=1.0；The x (t) of calculating=[1.0,1.0,1.0]^TBy sigmod The input vector that function normalization obtains finally is x [t]=[0.73,0.73,0.73]^T。

(3) training pattern.Improved self-adaptive PID controller structure is first as shown in Fig. 2, after state vector is calculated State vector is first passed to Actor networks, Actor networks output P, the mean μ and variances sigma of tri- parameters of I, D, according to Gauss Sampling draws P, and new parameter value, is assigned to incremental timestamp device, controller is according to error and newly by the actual parameter value of I, D Pid parameter calculate controlled quentity controlled variable u (t)

U (t)=u (t-1)+Δ u (t)=u (t-1)+K_I(t)e(t)+K_P(t)Δe(t)+K_D(t)Δ²e(t)

Controlled quentity controlled variable effect it is discrete after transmission function, calculate the output signal value of subsequent time t+1 according to the process of (1) Yourt (t+1), error amount, state vector.In addition, environment reward function goes out the award value of control agents according to error calculation, Reward function is as follows：

R (t)=α₁r₁(t)+α₂r₂(t)

Wherein α 1=0.6, α 2=0.4, e (t)=0.001

Reward function is the important component of intensified learning, after the value that receives awards, award value and subsequent time State vector passes to Critic networks, and Critic networks export the state value at t and t+1 moment, and calculate TD errors, calculate Formula is as follows：δ_TD=r (t)+γ V (S_t+1,W_v′)-V(S_t,W_v'), W_v' it is Critic network weights.Because the fortune of thread It is not synchronous to calculate speed, thus each controller be not fixed order must be to being stored in the Global Net in Fig. 2 Actor networks and Critic network parameters are updated, and more new formula is： Wherein W_aFor the Actor of middle cerebral storage Network weight, W '_aFor the weights of the Actor networks of each AC structures, W_vFor the Critic network weights of middle cerebral storage, W '_v Represent the Critic network weights of each AC structures, α_a=0.001 be Actor learning rate, α_c=0.01 is the study of Critic Rate, has completed training once herein, and after iteration 3000 times, algorithm is to reach stable state.

(4) experimental data is gathered.Using trained controller model, because setting 4 threads is controlled training, The highest thread of cumulative award is chosen as test controller when controlling and testing.According to the control parameter of setting in (1) It is controlled test, a length of 1s during control, that is, carry out 1000 secondary controls.According to the calculation in (2), state vector is calculated, And trained model is passed to, in test process is controlled, Critic networks no longer work, Actor outputs P, I, D Parameter value, during test is controlled, yourt, rin, u, P, I, D values, which preserve, is used for visual analyzing.

(5) data visualization.The data preserved in (4) are utilized into matlab software visualization tools, visual analyzing：Such as Shown in attached drawing 3, attached drawing 3 represents_yThe output valve of ourt, controller can reach within the time less than 0.2s stable state and With regulating power quickly.The output signal of the controlled quentity controlled variable of device in order to control of attached drawing 4, reaches quickly from what figure can obtain that controller can be To stable state.The control error of the device in order to control of attached drawing 5, wherein control error subtracts output signal equal to input signal amount Amount.Attached drawing 6 device P in order to control, I, the situation of change of D parameters, it can be seen that reach stablize before 3 parameters there is different journeys The adjustment of degree, after system stabilization, parameter then no longer changes.Using identical control object and input signal, to fuzzy adaptive PID controller and the adaptive pid controllers of Actor-Critic is answered to carry out Experimental comparison, the signal output contrast of three kinds of controllers Scheme visible attached drawing 7, visible attached drawing 8 is analyzed in control in detail, as shown in figure 8, too many professional people is being not required in the controller of the present invention While member's priori, it is same with fuzzy controller have a less overshoot but response speed faster, than AC- While PID controller has faster pace of learning, overshoot and response speed all occupy very big advantage.

Present invention aim to address conventional self-adaptive PID controller there are the problem of, Fuzzy Adaptive PID and expert Self-adaptive PID controller needs the relevant knowledge of a large amount of professionals, and the teacher signal of Neural Network Adaptive PID Control device is difficult To obtain, but because A3C learning structures are a kind of learning algorithms of intensified learning, the ability without model on-line study is not required to Want too many professional priori and teacher signal so as to solving fuzzy, Expert self-adaptive PID control device and nerve net Network self-adaptive PID controller there are the problem of.Again because the learning algorithm substantially increases AC- in the study of CPU multi-threaded parallels The learning rate of PID controller, and have more preferable control effect.The more visible attached drawing 7 of specific control effect, attached drawing 7 are choosing Three kinds of controllers：The A3C-PID controllers of fuzzy controller, AC-PID controllers and the present invention carry out identical ginseng Control under several is compared, and visible attached drawing 8 is analyzed in control in detail：The controller of the present invention is being not required too many professional's priori to know While knowledge, it is same with fuzzy controller have a less overshoot but response speed faster, than AC-PID controller While with faster pace of learning, overshoot and response speed all occupy very big advantage.

The present invention is not limited to above-mentioned embodiment, according to the above, according to the ordinary technical knowledge of this area And customary means, under the premise of the above-mentioned basic fundamental thought of the present invention is not departed from, the present invention can also make other diversified forms Equivalent modifications, replacement or change, belong to protection scope of the present invention.

Claims

1. the reinforcement learning adaptive PID control method of a kind of parallel optimization, it is characterised in that comprise the following steps：

Step S1：With MATLAB softwares, the continuous transmission function of an arbitrary order by control system is defined, is kept by zeroth order Its discretization is obtained the discretization transmission function at a self defined time interval, initialization controller parameter and M control by device method Thread processed carries out collateral learning, and wherein parameter mainly includes BP neural network parameter and PID control environmental parameter, each thread For an independent control agents；

Step S2：After the control object for initializing BP nerve nets weighting parameter and PID controller, a discrete input letter is defined Number RIN, will be discrete after input signal be passed to the transmission function after discretization successively according to the time interval of definition, calculate biography The output valve of delivery function, and using input and input vector x of the difference as A3C Adaptive PID Control algorithms for exporting signal (t)；

Step S3：The input vector x (t) obtained in step S2 is passed to the A3C adaptive PID Controls put up to carry out Repetitive exercise, obtains trained model after iteration n times；

Step S4：Test is controlled using trained model, records input signal, exports signal, the change of pid parameter Value；

Step S5：Input signal using the experimental data that Matlab visualization steps S4 is obtained including controller, output Signal, the changing value of pid parameter, and and Fuzzy Adaptive PID Control, AC-PID Adaptive PID Controls be controlled effect pair Than.

A kind of 2. reinforcement learning adaptive PID control method of parallel optimization according to claims, it is characterised in that Step S3 comprises the following steps：

Step S31：Calculate error current e (t), first-order error Δ e (t), second order error Δ e²(t) the input vector x as algorithm (t)=[e (t), Δ e (t), Δ²e(t)]^T, and it is normalized with sigmod function pairs；

Step S32：Input vector is passed to the Actor networks of each thread, and obtains the new parameters of PID.Actor networks It is not the average and variance of the parameter value of direct output PID but the Gaussian Profile of output tri- parameters of PID, passes through three ginsengs Several Gaussian Profiles estimates three parameter values, o=1, when 2,3, output layer output be pid parameter average, o=4,5,6 When, output be pid parameter variance, wherein Actor networks are totally 3 layers of a BP neural networks：1st layer is input layer, the 2nd Layer is the input of hidden layer

The output ho of hidden layer_k(t)=min (max (hi_k(t), 0), 6) k=1,2,3 ... 20,

3rd layer is output layer, the input of output layer

The output of output layer

Step S33：New pid parameter is assigned to controller, obtains control output, calculates control error, according to Environment reward function R (t) calculates award value, R (t)=α₁r₁(t)+α₂r₂(t), To the vector value x ' (t) of next state；

Step S34：By reward function R (t), current state vector x (t), next state vector x ' (t) passes to Critic nets Network, Critic network structures are similar with Actor network structures, and difference lies in output node to only have one, and Critic networks are main Output state value simultaneously calculates TD errors, δ_TD=r (t)+γ V (S_t+1,W_v′)-V(S_t,W_v′)；

Step S35：After TD errors are calculated, each Actor-Critic networks in A3C structures can't directly update certainly The network weight of body, but the Actor-Critic networks of renewal middle cerebral (Global-net) storage are removed with the gradient of itself Parameter, update mode areW_a=W_a+α_adW_a, W_v=W_v+α_cdW_v, wherein W_aFor the Actor network weights of middle cerebral storage, W '_aFor the power of the Actor networks of each AC structures Value, W_vFor the Critic network weights of middle cerebral storage, W '_vRepresent the Critic network weights of each AC structures, α_aFor The learning rate of Actor, α_cFor the learning rate of Critic, it is newest to pass to each one, AC structures for middle cerebral after the updating Parameter；

A kind of 3. reinforcement learning adaptive PID control method of parallel optimization according to claims, it is characterised in that Step S4 comprises the following steps：

Step S41：The input signal defined using step S1, is delivered to the control of the highest thread of trained reward function Model；

Step S42：Calculated after S41 it is current, once, second order error as input vector, be input to selected Controlling model, Unlike training process, it is only necessary to the pid parameter adjustment amount of Actor networks output, and the pid parameter transmission after adjustment To controller, the output of controller is obtained；