CN117332693A

CN117332693A - Slope stability evaluation method based on DDPG-PSO-BP algorithm

Info

Publication number: CN117332693A
Application number: CN202311337010.XA
Authority: CN
Inventors: 秦浩东; 李文荣; 杨跃光; 张晓宸; 张彬; 王敩青; 廖玉琴; 毛强; 张怿宁
Original assignee: China Southern Power Grid Corp Ultra High Voltage Transmission Co Electric Power Research Institute
Current assignee: China Southern Power Grid Corp Ultra High Voltage Transmission Co Electric Power Research Institute
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-01-02

Abstract

The invention relates to a slope stability evaluation method based on a DDPG-PSO-BP algorithm, which comprises the following steps: selecting a slope data sample; constructing a BP neural network with single or multiple hidden layers according to the slope data sample; parameters of an initializing particle swarm algorithm, and determining a learning factor c of initializing particles ₁ And c ₂ The method comprises the steps of carrying out a first treatment on the surface of the Performing PSO-BP algorithm iterative training, and updating the positions and speeds of all particles in a particle swarm; outputting a trained PSO-BP algorithm model, and predicting errors and G _best And the maximum iteration number is used as state information to be transmitted into a DDPG algorithm model of deep reinforcement learning, andoutputting action information according to the state information, and updating the learning factor c ₁ And c ₂ The method comprises the steps of carrying out a first treatment on the surface of the Will update the learning factor c ₁ And c ₂ And carrying out slope stability evaluation by carrying out the DDPG-PSO-BP algorithm model after the training in the PSO-BP algorithm model. According to the invention, the learning factors in the particle swarm algorithm are optimized through the DDPG algorithm of the deep reinforcement learning, so that the variation trend of PSO-BP algorithm parameters in the iterative process is improved, and the prediction precision of the trained model is higher.

Description

Slope stability evaluation method based on DDPG-PSO-BP algorithm

Technical Field

The invention relates to the technical field of slope stability prediction, in particular to a slope stability evaluation method based on a DDPG-PSO-BP algorithm.

Background

The evaluation and prediction method of the slope stability is core content of the slope engineering, and comprises a limit balance method, an elastoplastic theory method and the like, and the methods are strict in steps, perfect in theory and widely applied to practical engineering. However, the factors influencing the stability of the slope are numerous, the selection of the rock-soil mass parameters has a certain artificial subjectivity, the prediction of the stability of the slope has a large degree of uncertainty and high nonlinearity, and the prediction cannot be expressed by an accurate mathematical model and formula.

The BP neural network has simple initial parameter selection, higher self-adaptive capacity and fault tolerance capacity for processing complex nonlinear relations, but has the problems of easy sinking into local minimum value, low algorithm convergence speed and the like, and can be optimized by a particle swarm algorithm.

In the particle swarm optimization algorithm, a learning factor is a very important parameter, has a great influence on the convergence process and effect of the algorithm, and when the learning factor is smaller in value, the movement speed of particles in a search space is slower, so that the particles pay more attention to an individual optimal solution, and the convergence speed is slower; when the learning factor is larger, the particle speed change is large, so that the particles pay more attention to the global optimal solution, and the possibility that the algorithm oscillates and skips the global optimal solution is increased. However, in the conventional particle swarm optimization algorithm, a fixed value is usually selected by researchers, so that the searching speed and effect of the algorithm are greatly limited.

Therefore, a new slope stability evaluation prediction model capable of optimizing learning factors in a particle swarm algorithm needs to be established to improve the accuracy and reliability of evaluation and prediction.

Disclosure of Invention

In view of the shortcomings of the prior art, the main purpose of the invention is to provide a slope stability evaluation method based on a DDPG-PSO-BP algorithm, so as to solve one or more problems in the prior art.

The technical scheme of the invention is as follows:

a slope stability evaluation method based on a DDPG-PSO-BP algorithm comprises the following steps:

s1: selecting a side slope data sample, and converting the sample data type;

s2: determining the structure of a BP neural network, and constructing the BP neural network with single or multiple hidden layers according to the slope data sample;

s3: the PSO algorithm is integrated into the BP neural network, and the learning factor c of the initializing particles is determined according to the parameters of the established BP neural network initializing particle swarm algorithm ₁ And c ₂ ；

S4: performing PSO-BP algorithm iterative training, and updating optimal solution vector P of single particle by comparing fitness value _best And a global optimal solution vector G of the particle swarm _best Updating the positions and the speeds of all particles in the particle swarm;

s5: when the PSO-BP algorithm reaches a preset end condition, the prediction error and G are calculated _best And the maximum iteration number is used as state information to be transmitted into a DDPG algorithm model of deep reinforcement learning, and according to the state informationStatus information output action information, update learning factor c ₁ And c ₂ ；

S6: the updated learning factor c ₁ And c ₂ And carrying out updating after carrying out updating in the PSO-BP algorithm model to obtain new speed and position, obtaining a finally trained DDPG-PSO-BP algorithm model, and evaluating the slope stability.

In some embodiments, the converting the sample data type includes:

normalizing the side slope data sample:

wherein X and X' are numerical values before and after calculation of the slope data sample; x is X _max 、X _min Inputting and outputting the maximum value of each column of data for the slope data sample; a. b is a constant.

In some embodiments, in S2, the determining the structure of the BP neural network, and constructing the BP neural network with single or multiple hidden layers according to the slope data sample includes:

initializing network parameters, and setting the number of neurons of an input layer, a hidden layer and an output layer of the BP neural network;

determining the node numbers of an input layer and an output layer of the BP neural network according to the side slope sample data, and selecting the node number of a hidden layer according to the following formula:

wherein l, m and n are respectively the node numbers of an input layer, a hidden layer and an output layer of the BP neural network, alpha is an adjusting constant, and alpha=1, 2,3, & gt, 10;

setting an activation function, and selecting Sigmoid as the activation function:

in the formula e ^-x An exponential function of the natural constant e;

selecting a mean square error function as a loss function:

wherein E is total error, n is the number of samples; i is the dimension of the data; y is _i Outputting a value for the network; t is t _i Is a tag value.

In some embodiments, in S3, the parameters of the particle swarm algorithm initialized according to the established BP neural network include:

determining population number, maximum iteration number, position range and speed range, and initializing learning factor c ₁ And c ₂ Position and velocity vectors;

constructing a single particle network and a particle swarm network, comprising:

(1) Selecting a spatial dimension D:

D＝l×m+m×n+m+n

wherein l, m and n are the node numbers of the BP neural network input layer, the hidden layer and the output layer respectively;

(2) In the spatial dimension D, the position X of the ith particle _i And velocity V _i Expressed as:

X _i ＝(x _i1 ,x _i2 ,...,x _iD )，i∈[1,2,...,N]

V _i ＝(v _i1 ,v _i2 ,...,v _iD )，i∈[1,2,...,N]

wherein x is _i1 、x _i2 ...、x _iD For the position vector of particle i in the D-th dimension in a certain iteration, v _i1 、v _i2 ...、v _iD In (a) is the velocity vector of the particle i in the D-th dimension in a certain iteration, and the position and the velocity of the particle are defined by the maximum position X _max And maximum speed V _max Limit, and X _i ∈[-X _max ，X _max ]，V _i ∈[-V _max ，V _max ]。

In some embodiments, in S4, the updating the positions and velocities of all particles in the population of particles includes:

in the process of PSO-BP algorithm iteration, the position of the particle is brought into a BP neural network to obtain a predicted value;

the predicted value is brought into a fitness function to obtain a fitness value of the particles, and the fitness value is compared to search the optimal position P of each particle in the space _best And the global optimum G of the particle swarm _best ；

Updating the optimal solution vector P of single particles according to the back propagation capability of BP neural network _best And a global optimal solution vector G of the particle swarm _best And updating the positions and the speeds of all particles in the particle swarm.

In some embodiments, the fitness function is a mean square error function, and the fitness value of each particle is calculated according to a mean square error formula:

In some embodiments, the search is performed for the individual optimal position P of each particle _best And the global optimum G of the particle swarm _best Comprising the following steps:

each particle independently searches the optimal position of the particle in the space until the current iteration step number, and the searched optimal position, namely the individual extremum, is marked as P _best ：

P _best ＝(p _i1 ,p _i2 ,...,p _iD )，i∈[1,2,...,N]

Wherein p is _i1 、p _i2 、...、p _iD The historical optimal position of the particle i in the D dimension in a certain iteration is the optimal solution obtained by searching the i-th particle after the certain iteration;

all particles in the past searching process reach in the whole particle swarmThe global optimum position, i.e. the global optimum, is noted as G _best ：

G _best ＝(p _g1 ,p _g2 ,...,p _gD )，i∈[1,2,...,N]

Wherein p is _g1 、p _g2 、...、p _gD The historical optimal position of the D dimension of the group g in a certain iteration is the optimal solution in the whole particle group after a certain iteration;

all particles in the particle swarm adjust own speed and position according to the extreme value and the global optimal value of the individual, and the updating formula is as follows:

v _i+1 ＝ωv _i +c ₁ r ₁ (P _best -x _i )+c ₂ r ₂ (G _best -x _i )

x _i+1 ＝x _i +v _i+1

wherein ω is inertial weight, linearly decreasing with iteration number, v _i+1 V for the next step velocity vector _i For the current velocity vector, x _i+1 For the next step position vector, x _i C is the current position vector ₁ And c ₂ Are learning factors, r ₁ 、r ₂ Is [0,1]Random number within range omega _max For maximum inertial weight, ω _min For the minimum inertia weight, t is the iteration step number, t _max Is the maximum number of iterative steps.

In some embodiments, in S5, the preset end condition is that the target convergence accuracy is reached or the maximum iteration number is reached, the target convergence accuracy is determined by the mean square error, if yes, the calculation is terminated, otherwise, the iteration number +1 returns to the previous step.

In some embodiments, in S5, the DDPG algorithm model comprises a Critic network Q (|θ) ^Q ) Actor network μ (|θ) ^μ ) Target critical network Q' (|θ) ^Q' ) And a Target Actor network μ' (|θ) ^μ' ) Which is provided withIn (a)

The Critic network updating process comprises the following steps:

calculating the action in the state s' by using the Target Actor network:

a'＝μ'(s'|θ ^μ' )

wherein a 'is Target Actor network μ' (|θ) ^μ' ) An action in state s';

calculating a Target value of the state action pair (s, a) by using the Target Critic network:

y＝r+γ(1-done)Q'(s',a'|θ ^Q' )

wherein y is a target value, r is an instant prize, gamma is a discount factor, done is a task completion flag, Q' (|theta) ^Q' ) Is a Target Critic network, s' is a state;

calculating the evaluation value θ of the state action pair (s, a) by using Critic network ^μ ；

Minimizing the difference L between the evaluation value and the expected value by gradient descent _c Thereby updating parameters in the Critic network:

L _c ＝(y-q) ²

wherein y is a target value, and q is a predicted value;

the updating process of the Actor network comprises the following steps: calculating action a in state s by using an Actor network:

a＝μ(s|θ ^μ )

wherein μ (|θ) ^μ ) Is an Actor network, s is a state;

calculating an evaluation value q of the state action pair (s, a) by using the Critic network:

q＝Q(s,a|θ ^Q )

wherein a is Critic network Q (|θ) ^Q ) An action in the state s;

finally, the gradient ascent method is utilized to maximize the accumulated expected return, so that the parameters in the Actor network are updated;

the Target Critic network updating process is as follows:

θ ^Q' ＝τθ ^Q +(1-τ)θ ^Q'

in θ ^Q’ Is a parameter of Critic network, θ ^Q Is a parameter of a Target Critic network;

the Target Actor network updating process is as follows:

θ ^μ' ＝τθ ^μ +(1-τ)θ ^μ'

where τ is the update weight, θ ^μ Is a parameter of an Actor network, θ ^μ’ Is a parameter of the Target Actor network.

In some embodiments, in S6, the maximum iteration number of the DDPG algorithm model is preset, and when the DDPG algorithm model reaches the maximum round, the learning factor c obtained by the convergence of the DDPG algorithm model is obtained ₁ And c ₂ And (3) returning to the PSO-BP algorithm model, updating the PSO-BP algorithm model to obtain new positions and speeds of all particles in the particle swarm, namely obtaining the optimal weight and bias of the BP neural network, and obtaining the finally trained DDPG-PSO-BP algorithm model.

In some embodiments, further comprising: s7: benchmark test verification specifically includes:

(1) The verification method comprises the following steps: selecting a plurality of groups of side slope data, wherein one part of the side slope data is used as a training sample to train and learn the BP neural network, and the other part of the side slope data is used as a test sample to test the feasibility of the DDPG-PSO-BP algorithm model;

(2) Prediction accuracy: selecting a mean square error function as a prediction error;

(3) And comparing and analyzing the PSO-BP algorithm model prediction result with the DDPG-PSO-BP algorithm model prediction result.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a slope stability evaluation method based on a DDPG-PSO-BP algorithm, which optimizes learning factors in a particle swarm algorithm by using a DDPG algorithm of deep reinforcement learning.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.

The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, but rather by the claims.

FIG. 1 is a schematic flow chart of a slope stability evaluation method based on a DDPG-PSO-BP algorithm according to some embodiments of the present invention;

FIG. 2 is a diagram of a DDPG algorithm model learning round number-return value;

FIG. 3 is a schematic diagram showing the comparison of fitness values of a PSO-BP algorithm model and a DDPG-PSO-BP algorithm model;

FIG. 4 is a schematic diagram showing the comparison of the prediction results of the PSO-BP algorithm model and the DDPG-PSO-BP algorithm model;

FIG. 5 is a schematic diagram showing the correlation between the PSO-BP algorithm model prediction result and test data;

FIG. 6 is a schematic diagram showing correlation between the model prediction result of the DDPG-PSO-BP algorithm and test data.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the embodiments and the accompanying drawings. The exemplary embodiments of the present invention and their descriptions herein are for the purpose of explaining the present invention, but are not to be construed as limiting the invention.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

It should be understood that the terms "comprises/comprising," "consists of … …," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product, apparatus, process, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product, apparatus, process, or method as desired. Without further limitation, an element defined by the phrases "comprising/including … …," "consisting of … …," and the like, does not exclude the presence of other like elements in a product, apparatus, process, or method that includes the element.

It is further understood that the terms "upper," "lower," "front," "rear," "left," "right," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship based on that shown in the drawings, merely to facilitate describing the present invention and to simplify the description, and do not indicate or imply that the devices, components, or structures referred to must have a particular orientation, be configured or operated in a particular orientation, and are not to be construed as limiting the present invention.

Deep Deterministic Policy Gradient (DDPG) algorithm is an online deep reinforcement learning algorithm specially solving the problem of continuous control, and has the advantages of high convergence rate, direct optimization of continuous action and the like.

According to the invention, the DDPG algorithm is introduced into the PSO-BP algorithm, and the student factors in the particle swarm algorithm are optimized through the DDPG algorithm of deep reinforcement learning, so that the variation trend of PSO-BP algorithm parameters in the iterative process is improved, and the accuracy of predicting the stability of the side slope of the trained model is higher.

The implementation of the present invention will be described in detail with reference to the preferred embodiments.

The invention provides a slope stability evaluation method based on a DDPG-PSO-BP algorithm, which mainly comprises the following steps of: s1: selecting a side slope data sample, and converting the sample data type; s2: determining the structure of a BP neural network, and constructing the BP neural network with single or multiple hidden layers according to the slope data sample; s3: the PSO algorithm is integrated into the BP neural network, and the learning factor c of the initializing particles is determined according to the parameters of the established BP neural network initializing particle swarm algorithm ₁ And c ₂ The method comprises the steps of carrying out a first treatment on the surface of the S4: performing PSO-BP algorithm iterative training, and updating optimal solution vector P of single particle by comparing fitness value _best And a global optimal solution vector G of the particle swarm _best Updating the positions and the speeds of all particles in the particle swarm; s5: when the PSO-BP algorithm reaches a preset end condition, outputting a trained PSO-BP algorithm model, and predicting errors and G _best And the maximum iteration number is used as state information to be transmitted into a DDPG algorithm model of deep reinforcement learning, action information is output according to the state information, and the learning factor c is updated ₁ And c ₂ The method comprises the steps of carrying out a first treatment on the surface of the S6: will update the learning factor c ₁ And c ₂ And updating the PSO-BP algorithm in the trained PSO-BP algorithm model to obtain a DDPG-PSO-BP algorithm model, and evaluating the slope stability.

The slope stability evaluation method provided by the invention can accurately and effectively predict and evaluate the safety of the slope, and is beneficial to finding and solving potential problems in advance, so that the safety and stability of the slope are improved.

Specifically, in S1, a side slope data sample is selected, and the sample data type is converted, so that the difference of each group of data in order of magnitude and dimension is eliminated.

In the invention, six parameters of geological condition factors influencing the slope stability are selected, namely the soil weight, cohesive force, internal friction angle, slope height and pore pressure.

The gravity of the soil is the gravity of the soil in unit volume, the sliding force of the side slope is directly influenced, and under the condition that other conditions are unchanged, the larger the sliding force of the side slope is, the higher the instability risk is; the cohesive force is the shear strength of the damaged surface without any positive stress; the internal friction angle is the friction characteristic formed by mutual movement and gluing action among particles in the soil body, and the cohesive force and the internal friction angle are the most important factors affecting the slope stability; the slope angle is the included angle between the slope surface and the horizontal plane; the slope height is the vertical height from the slope top to the horizontal plane where the slope angle is located; pore water pressure refers to the pressure of groundwater in soil or rock, and acts between particles or pores to influence soil weight, which are important parameters affecting slope stability.

The method and the device for predicting the slope stability by utilizing the parameters can accurately judge the current safety condition of the slope, and are convenient for timely treatment and maintenance.

Furthermore, the invention performs normalization processing on the side slope data samples before training:

S2, determining the structure of the BP neural network, and constructing the BP neural network with single or multiple hidden layers according to the slope data sample comprises the following steps:

initializing network parameters, and setting the number of neurons of an input layer, a hidden layer and an output layer of the BP neural network.

In this embodiment, geological condition factors such as soil gravity, cohesive force, internal friction angle, slope height and pore pressure which affect the slope stability are used as input values of the BP neural network, normalized data preprocessing is performed, and slope safety coefficients are used as output values.

in the formula, l, m and n are respectively the node numbers of the BP neural network input layer, the hidden layer and the output layer, alpha is an adjusting constant, and alpha=1, 2,3 and 10 can be taken.

In this embodiment, l is selected according to the number of input data, n is selected according to the number of output data, and α is selected according to the error in the iterative process.

Setting an activation function, the invention selects the Sigmoid with the widest use range as the activation function:

in the formula e ^-x Is an exponential function of the natural constant e.

In this embodiment, a mean square error function (MSE) is selected as the loss function:

In S3, the PSO algorithm is fused into the BP neural network, each particle represents a network, and the parameters for initializing the particle swarm algorithm comprise:

determining population number N, position range and speed range, initializing position X of particles _i And velocity V _i Setting the iteration times t and the maximum iteration times t _max Inertia weight omega, initializing learning factor c ₁ And c ₂ 。

Further, according to parameters of the established BP neural network initialization PSO algorithm, establishing a single particle network and a particle swarm network, selecting a proper space dimension D, and assuming that a group formed by N particles exists in a D-dimensional space, the dimension D can be determined according to the following formula:

D＝l×m+m×n+m+n

wherein, l, m and n are the node numbers of the input layer, the hidden layer and the output layer of the neural network respectively.

Further, in the spatial dimension D, the position X of the ith particle _i And velocity V _i Can be expressed as:

X _i ＝(x _i1 ,x _i2 ,...,x _iD )，i∈[1,2,...,N]

V _i ＝(v _i1 ,v _i2 ,...,v _iD )，i∈[1,2,...,N]

the position and velocity of the particles in this embodiment are determined by the maximum position X _max And maximum speed V _max Restriction, wherein X _i ∈[-X _max ，X _max ]，V _i ∈[-V _max ，V _max ]To avoid blind searches in space.

S4, performing PSO-BP algorithm iterative training, and updating the positions and speeds of all particles in the particle swarm, wherein the updating comprises the following steps:

According to the characteristics of back propagation adjustment weight and threshold value of BP neural network, updating the optimal solution vector P of single particle _best And a global optimal solution vector G of the particle swarm _best And updating the positions and the speeds of all particles in the particle swarm.

Further, each particle independently searches its own optimal position in space until the current iteration step number, and the searched optimal position, namely the individual extremum, is marked as P _best ：

P _best ＝(p _i1 ,p _i2 ,...,p _iD )，i∈[1,2,...,N]

Wherein p is _i1 、p _i2 、...、p _iD The optimal solution obtained by searching the ith particle (individual) is the historical optimal position of the ith particle (i) in the D-th dimension in a certain iteration, namely after a certain iteration.

Each particle shares information with other particles of the whole particle group, and the global optimal position, namely the global optimal value, reached by all particles in the past searching process is marked as G in the whole particle group _best ：

G _best ＝(p _g1 ,p _g2 ,...,p _gD )，i∈[1,2,...,N]

Wherein p is _g1 、p _g2 、...、p _gD The historical optimal position of the D-th dimension of the group g in a certain iteration is the optimal solution in the whole particle group after the certain iteration.

Further, all particles in the particle swarm adjust their own speed and position according to their own extremum and global optimum, and the update formula is as follows:

v _i+1 ＝ωv _i +c ₁ r ₁ (P _best -x _i )+c ₂ r ₂ (G _best -x _i )

x _i+1 ＝x _i +v _i+1

wherein ω is an inertial weight that decreases linearly with iteration number:

in the formula, v _i+1 V for the next step velocity vector _i For the current velocity vector, x _i+1 For the next step position vector, x _i C is the current position vector ₁ And c ₂ Collectively referred to as learning factors, the former represents the empirical coefficients of individual particles, and the latter represents the empirical coefficients of groups of particles; r is (r) ₁ 、r ₂ Is [0,1]Random number within range omega _max For maximum inertial weight, ω _min The minimum inertia weight is adopted, and t is the iteration step number;t _max is the maximum number of iterative steps.

Further, traversing all particles in the particle swarm, inputting slope sample data for network propagation, and calculating the fitness value of each particle according to a mean square error formula, namely, a mean square error formula:

In this embodiment, comparing the current fitness value of the particle with the individual optimization of the particle generation, if the current fitness value is better, let P _best If not, updating a single particle network according to the original particle optimization; comparing the current fitness value of the particle with the optimal value of all the particles in the population, if the current fitness value is better, giving the position of the current particle to G _best Otherwise, updating the particle swarm network according to the global history optimization; current P _best And G _best For comparison, if P _best Preferably, the particle is at the current P _best Give G _best Otherwise according to the original G _best Updating the particle swarm network.

S5, when the PSO-BP algorithm reaches a preset end condition, the prediction error and G _best And the maximum iteration number is used as state information to be transmitted into a DDPG algorithm model method of deep reinforcement learning, and then action information, namely an optimized learning factor c, is output according to the state information ₁ And c ₂ 。

In this embodiment, the preset ending condition of the PSO-BP algorithm model is that the target convergence accuracy is reached or the maximum iteration number is reached. It is easy to understand that the target convergence accuracy is judged by mean square error, if yes, the calculation is terminated, otherwise, the iteration number +1 returns to the previous step.

Deep reinforcement learning mainly focuses on how an agent interacts with an environment, the agent receives the state of the environment, makes an action through a certain strategy, the environment updates the state of the agent according to the action of the agent, gives a reward, and the agent updates the agent according to the reward.

The optimization process of learning factors in the PSO-BP algorithm model may be modeled as a Markov Decision Process (MDP), which is represented by the tuples < S, A, P, R, gamma >.

Wherein S is a state space, i.e. the prediction error, G in the present invention _best And a maximum number of iterations representing a set of all possible states of the agent in the environment; a is the action space, i.e. the learning factor c in the present invention ₁ And c ₂ Representing a set of all possible actions taken by an agent in an environment; p is a state transition probability matrix, P (S 'S, a) is the probability of taking action to transition to a new state S' S under state S; r is a system bonus function, the invention is designed to be r=k1×1/prediction error+k2×1/iteration number; r (s, a) is the immediate prize obtained after action a is taken in state s; gamma e [0,1 ]]Is a discount factor that is used to represent the impact of future rewards on the current time decision.

Further, the deep reinforcement learning includes a state value function and an action value function, where the state value function refers to the desire to obtain a return using a strategy pi under a certain state s:

V _π (s)＝E[G _t |S _t ＝s]

wherein E is the desired G _t In return, S _t Is in a state;

V _π(s) the method is used for measuring the expectation of the sum of rewards obtained by the agent from the beginning of the state s to the end of the final state, and further guiding the updating of the strategy.

The action value function refers to the desire to obtain a return using the policy pi to perform action a in a certain state s:

Q _π (s,a)＝E[G _t |S _t ＝s,A _t ＝a]

wherein E is the desired G _t In return, S _t In the state of A _t Is action;

Q _π(s，a) for measuring the onset of action a by an agent in state s to the mostThe end state ends the expected value of the sum of rewards available.

In one round, deep reinforcement learning obtains c according to state information of PSO-BP algorithm model ₁ And c ₂ Based on the current c, PSO-BP algorithm model ₁ And c ₂ Predicting the value again to obtain a reward value, storing the state, the action and the reward information into an experience buffer area, and extracting a batch of data from the experience buffer area by the DDPG algorithm model for updating the network. The DDPG algorithm model is continuously circulated and iterated until the maximum round number is reached, and the optimal c after convergence is obtained ₁ And c ₂ 。

The experience buffer zone adopts experience playback technology, it is easy to understand that experience playback is technology which enables experience probability to be stable respectively, and can improve training stability, and the invention is mainly divided into two steps of 'storage' and 'playback', wherein 'storage' is to make experience in terms of(s) _t ，a _t ，r _t+1 ，s _t+1 Done) form is stored in an experience pool; "playback" is the sampling of one or more pieces of empirical data from a pool of experiences according to some rule.

The DDPG algorithm model in the step comprises a Critic network Q (|theta) ^Q ) Actor network μ (|θ) ^μ ) Target critical network Q' (|θ) ^Q ') and Target Actor networks μ' (|θ) ^μ' ) These four networks. Wherein the Critic network updates the parameter θ of the Critic network by minimizing an error between the evaluation value and the target value ^Q The Actor network updates the Actor network's parameter θ by maximizing the accumulated expected return ^μ 。

Further, the Critic network updating process is as follows: calculating the action in the state s' by using the Target Actor network:

a'＝μ'(s'|θ ^μ' )

where a 'is the action of the Target Actor network in state s'.

y＝r+γ(1-done)Q'(s',a'|θ ^Q' )

wherein y isTarget value, r is immediate rewarding, gamma is discount factor, done is task completion sign, Q' (|theta) ^Q' ) For the Target Critic network, s' is the state.

L _c ＝(y-q) ²

where y is a target value and q is a predicted value.

Further, the updating process of the Actor network is as follows: calculating action a in state s by using an Actor network:

a＝μ(s|θ ^μ )

wherein μ (|θ) ^μ ) And s is a state, and s is an Actor network.

q＝Q(s,a|θ ^Q )

wherein a is Critic network Q (|θ) ^Q ) S is an operation in the state.

And finally, maximizing accumulated expected return, namely an evaluation value q by using a gradient ascent method, so as to update parameters in the Actor network.

The Target Critic network updating process is as follows:

θ ^Q '＝τθ ^Q +(1-τ)θ ^Q'

the Target Actor network updating process is as follows:

θ ^μ' ＝τθ ^μ +(1-τ)θ ^μ'

It is easy to understand that the DDPG algorithm in reinforcement learning is an online deep reinforcement learning algorithm under the Actor-Critic framework, and a set of Target Actor network and Target Critic network for estimating targets are used outside the Actor network and Critic network, which are integrated. By using the target network, concussion and instability of the estimated target value can be reduced.

S6, presetting the maximum iteration times of the DDPG algorithm model, and when the DDPG algorithm model reaches the maximum round, converging the DDPG algorithm model to obtain a learning factor c ₁ And c ₂ And (3) returning to the PSO-BP algorithm model, updating the PSO-BP algorithm model to obtain new positions and speeds of all particles in the particle swarm, namely obtaining the optimal weight and bias of the BP neural network, and obtaining the finally trained DDPG-PSO-BP algorithm model.

According to the invention, the finally trained DDPG-PSO-BP algorithm model is utilized to evaluate the stability of the side slope, and the problems of poor local optimizing capability, low convergence speed, low result precision and the like of the traditional PSO-BP algorithm model are solved.

The invention also includes S7: and (5) benchmark test verification. The method specifically comprises the following steps:

(1) The verification method comprises the following steps: in order to verify the prediction performance of a DDPG-PSO-BP algorithm model, the invention carries out model training and data prediction aiming at actual side slope data, and selects a plurality of groups of side slope data, wherein one part is used as a training sample, the BP neural network is trained and learned, and the other part is used as a test sample for verifying the feasibility of the DDPG-PSO-BP algorithm model.

In a specific embodiment, 85 groups of side slope data are selected in total, the whole sample data are divided into two parts, 70 groups of data are used as training samples, training learning is carried out on a network, and 15 groups of data are used as test samples for checking the feasibility of the network model constructed by the invention.

(2) Prediction accuracy: to more accurately evaluate the performance of the model, a mean square error function (MSE) is selected as the prediction error:

wherein E is the total error, nThe number of the samples; i is the dimension of the data; y is _i Outputting a value for the network; t is t _i Is a tag value.

In a specific embodiment, the accuracy prediction is performed on the slope data by using a PSO-BP algorithm model and a DDPG-PSO-BP algorithm model, and the convergence condition and performance of the report value judgment algorithm are often used in the reinforcement learning training stage, as shown in fig. 2, as the number of rounds increases, the report value increases, and finally, the convergence is stabilized to about 25 at about 600. As a result of the prediction, as shown in fig. 3, the prediction error was 0.0141 by using the PSO-BP algorithm model and 0.0026 by using the DDPG-PSO-BP algorithm model for the test data. Therefore, the prediction accuracy of the DDPG-PSO-BP algorithm model is higher.

In a specific embodiment, refer to fig. 4 to 6, where fig. 4 is a comparison chart of the prediction results of the conventional PSO-BP algorithm model and the DDPG-PSO-BP algorithm model, and the DDPG-PSO-BP is closer to the actual curve; fig. 5 is a graph of correlation between a prediction result of a PSO-BP algorithm model and test data, wherein the correlation between PSO-BP is 0.59938, and fig. 6 is a graph of correlation between a prediction result of a DDPG-PSO-BP algorithm model and test data, wherein the correlation between DDPG-PSO-BP is 0.92607. Therefore, the predicted result of the DDPG-PSO-BP algorithm model is closer to the test data and is better than the traditional PSO-BP algorithm model.

The invention provides a slope stability prediction method based on an improved DDPG-PSO-BP algorithm, which optimizes a learning factor in a particle swarm algorithm speed iteration formula by using a DDPG algorithm of deep reinforcement learning.

It is easy to understand by those skilled in the art that the above preferred embodiments can be freely combined and overlapped without conflict.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. A slope stability evaluation method based on a DDPG-PSO-BP algorithm is characterized by comprising the following steps:

s1: selecting a side slope data sample, and converting the sample data type;

s5: when the PSO-BP algorithm reaches a preset end condition, the prediction error and G are calculated _best And the maximum iteration number is used as state information to be transmitted into a DDPG algorithm model of deep reinforcement learning, action information is output according to the state information, and the learning factor c is updated ₁ And c ₂ ；

2. The slope stability evaluation method based on DDPG-PSO-BP algorithm according to claim 1, wherein in S1, the conversion sample data type includes:

normalizing the side slope data sample:

3. The slope stability evaluation method based on DDPG-PSO-BP algorithm according to claim 1, wherein in S2, the determining the structure of the BP neural network, and constructing the BP neural network with single or multiple hidden layers according to the slope data sample, comprises:

in the formula e ^-x An exponential function of the natural constant e;

selecting a mean square error function as a loss function:

4. The slope stability evaluation method based on the DDPG-PSO-BP algorithm according to claim 1, wherein in S3, the parameters according to the established BP neural network initialization particle swarm algorithm include:

(1) Selecting a spatial dimension D:

D＝l×m+m×n+m+n

X _i ＝(x _i1 ,x _i2 ,...,x _iD )，i∈[1,2,...,N]

V _i ＝(v _i1 ,v _i2 ,...,v _iD )，i∈[1,2,...,N]

5. The slope stability evaluation method based on DDPG-PSO-BP algorithm according to claim 4, wherein in S4, updating the positions and velocities of all particles in the particle swarm comprises:

6. The slope stability evaluation method based on the DDPG-PSO-BP algorithm according to claim 5, wherein the fitness function is a mean square error function, and the fitness value of each particle is calculated according to a mean square error formula:

7. The slope stability evaluation method based on DDPG-PSO-BP algorithm according to claim 5, wherein the searching of the individual optimum position P of each particle _best And the global optimum G of the particle swarm _best Comprising the following steps:

P _best ＝(p _i1 ,p _i2 ,...,p _iD )，i∈[1,2,...,N]

Wherein p is _i1 、p _i2 、...、p _iD The historical optimal position of the particle i in the D-th dimension in a certain iteration is obtained by searching the i-th particle after the certain iterationIs the optimal solution of (a);

in the whole particle swarm, the global optimal position, namely the global optimal value, reached by all particles in the past generation searching process is marked as G _best ：

G _best ＝(p _g1 ,p _g2 ,...,p _gD )，i∈[1,2,...,N]

v _i+1 ＝ωv _i +c ₁ r ₁ (P _best -x _i )+c ₂ r ₂ (G _best -x _i )

x _i+1 ＝x _i +v _i+1

8. The slope stability evaluation method based on the DDPG-PSO-BP algorithm according to claim 1, wherein in S5, the preset end condition is that the target convergence accuracy is reached or the maximum iteration number is reached, the target convergence accuracy is judged by mean square error, if yes, the calculation is terminated, otherwise, the iteration number +1 returns to the previous step.

9. The slope stability evaluation method based on the DDPG-PSO-BP algorithm as set forth in claim 1, wherein in S5, the DDPG algorithm model comprises a Critic network Q (|θ) ^Q ) Actor network μ (|θ) ^μ ) Target critical network Q' (|θ) ^Q' ) And a Target Actor network μ' (|θ) ^μ' ) Wherein

The Critic network updating process comprises the following steps:

calculating the action in the state s' by using the Target Actor network:

a'＝μ'(s'|θ ^μ' )

wherein a 'is Target Actor network μ' (|θ) ^μ' ) An action in state s';

y＝r+γ(1-done)Q'(s',a'|θ ^Q' )

L _c ＝(y-q) ²

wherein y is a target value, and q is a predicted value;

a＝μ(s|θ ^μ )

wherein μ (|θ) ^μ ) Is an Actor network, s is a state;

q＝Q(s,a|θ ^Q )

wherein a is Critic network Q (|θ) ^Q ) In the state of sActs of (a);

the Target Critic network updating process is as follows:

θ ^Q '＝τθ ^Q +(1-τ)θ ^Q'

in θ ^Q' Is a parameter of Critic network, θ ^Q Is a parameter of a Target Critic network;

the Target Actor network updating process is as follows:

θ ^μ' ＝τθ ^μ +(1-τ)θ ^μ'

where τ is the update weight, θ ^μ Is a parameter of an Actor network, θ ^μ' Is a parameter of the Target Actor network.

10. The slope stability evaluation method based on a DDPG-PSO-BP algorithm according to claim 1, wherein in S6, the maximum iteration number of the DDPG algorithm model is preset, and when the DDPG algorithm model reaches the maximum round, the learning factor c obtained by converging the DDPG algorithm model is obtained ₁ And c ₂ And (3) returning to the PSO-BP algorithm model, updating the PSO-BP algorithm model to obtain new positions and speeds of all particles in the particle swarm, namely obtaining the optimal weight and bias of the BP neural network, and obtaining the finally trained DDPG-PSO-BP algorithm model.

11. The slope stability evaluation method based on the DDPG-PSO-BP algorithm according to claim 1, further comprising: s7: benchmark test verification specifically includes: