Disclosure of Invention
In order to overcome the defects in the prior art, the method for predicting the regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm is based on the angle of the driving track, and is comprehensively considered from multiple aspects such as an environment model and road attributes so as to improve the accuracy of flow distribution prediction.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: the method for predicting regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm comprises the following steps:
s1, carrying out urban road network modeling on the area to be predicted, and collecting the driving track data in the area;
s2, extracting flow characteristics in the area to be predicted through a maximum entropy inverse reinforcement learning method based on multiple weights based on the urban road network and the traffic track data corresponding to the area to be predicted;
and S3, processing the urban road network based on the extracted traffic characteristics and the current traffic control scheme by a forward reinforcement learning method to obtain a traffic distribution prediction result of the area to be predicted.
Further, in step S1, the method for modeling an urban road network for the area to be predicted specifically includes:
a1, in the area to be predicted, regarding the intersection as a point V and regarding the road between two adjacent intersections as a side E to obtain an entity road network G (V, E);
a2, regarding the crossing in the entity road network G (V, E) as the state in the Markov process, regarding the road as the behavior required by the intelligent agent in the state transition, and further analogizing the entity road network G (V, E) to the environment model;
a3, determining an environment matrix F of the environment model based on the connectivity of the road network in the entity road network G (V, E), and determining a state transition matrix P according to the attributes of the roads in the entity road network G (V, E), thereby completing the modeling of the city road network;
wherein, the environment matrix F is:
F=[(e1,l1),(e2,l2),...,(ef,lf)]
wherein (e)f',lf')∈F,ef'Is a road network node,/f'For the road network node as ef'The number of motor vehicle lanes, subscript f 'is a road network node ordinal number, and f' is 1,2, 3.
The state transition matrix P is a three-dimensional matrix with the size of m multiplied by n multiplied by m established on the basis of an environment matrix, wherein m is the number of states, and n is the number of behaviors; each element P in the state transition matrix Psas'Indicating that taking action a in state s transitions to a new one of states s', and Psas'∈[0,1]。
Further, the step S2 is specifically:
s21, constructing an expert strategy set T ═ ξ { ξ) based on urban road network and collected driving track data0,ξ1,...,ξm};
Wherein ξi∈T,ξiFor expert strategies, the index i is 1,2,3iIs xii={(s0,a0),(s1,a1),...,(sn,an)},(st,at)∈ξi,(st,at) Is a binary group, stIs in a state oftFor activity, the subscript t is the tth time, t 1,2, 3.
S22, setting each expert strategy xi in the expert strategy set TiState of(s)tSetting a characteristic expectation weight, and calculating the characteristic expectation of an expert strategy;
s23, setting a characteristic weight coefficient x based on the condition that the lengths of the expert strategies in the expert strategy set are different, and initializing the characteristic weight coefficient x;
s24, determining the state return rate c according to the characteristic weight coefficient x, and further obtaining a reward function
And a state action value function p (s' | s, a);
s25, calculating the expected state access frequency through a forward and backward algorithm
S26, accessing frequency based on expected state
Updating the feature weight coefficients by using the gradient vectors;
s26, repeating the steps S24-S25 until convergence, and entering the step S27;
and S27, taking the current reward function as the optimal reward function to obtain the flow characteristics.
Further, in the step S22, the expert strategy ξiState of(s)tIs expected weight z of the featurenComprises the following steps:
in the step S22, the expert policy ξ
iCharacteristic expectation of
Comprises the following steps:
in the formula, l is expert strategy xi
iLength weight of (d), γ
tIn order to obtain the learning rate of the algorithm,
as a function of the reward, i.e. the reward given to the ith trace for its state at time t,
the state of the ith track at the time t is shown;
wherein the expert strategy xiiThe length weight of (l) is:
in the formula, len (-) is the length of the expert strategy;
the state action value function p (S' | S, a) in step S24 is:
in the formula, p (s '| s, a) represents the probability that agent i takes action a in state s, reaching state s'.
In step S26, gradient vectors are used
The formula for updating the feature weight coefficients is:
wherein L (theta) is a maximum likelihood function,
for the access frequency actually calculated from the expert trajectory,
is a binary group(s)
t,a
t) The desired frequency of the frequency of,
is the corresponding feature vector.
Further, the step S3 is specifically:
s31, and based on the expert strategy set T ═ ξ { ξ) corresponding to the urban road network under the current traffic control scheme0,ξ1,...,ξmConstructing an environment model M;
s32, arranging a plurality of agents in the constructed environment model, and arranging three experience cache pools in each agent;
wherein, the three experience cache pools in each agent are priority cache pools R respectivelypriorityRevenue caching pool RsuccessAnd a loss buffer pool Rfailure;
S33, each agent selects a behavior a in turntModifying the state s of a current agenttAnd gives a corresponding return value rtAnd then determining the corresponding experience(s)t,at,rt,st+1);
S34, based on the reported value rtWill experience(s)t,at,rt,st+1) And the corresponding priority Y is stored in an experience cache pool corresponding to the agent;
s35, slave agent and state S thereoftExtracting experiences from an experience cache pool of the nearest agent according to a set proportion to form a training set, and training the DDPG network structure of the agent;
s36, and the track similarity D of each trained agentsClustering is carried out to realize grouping of the agents;
s37, carrying out intra-group evolution operation on each group of agents obtained by grouping to obtain a plurality of new agent subgroups to form a track set of the current city road network;
and S38, counting the flow of each road section in the urban road network based on the track set of the current urban road network, and realizing the flow distribution prediction.
Further, the step S34 is specifically:
r1, experience(s)t,at,rt,st+1) Deposit to priority cache pool RpriorityIn calculating the experience(s)t,at,rt,st+1) And stores the priority Y into a priority cache pool Rpriority;
R2, experience(s)t,at,rt,st+1) Is given a return value rtTo be positive, the experience(s) is given according to its priority Yt,at,rt,st+1) Deposit into profit cache pool Rsuccess;
When the experience(s)t,at,rt,st+1) Is given a return value rtWhen negative, the experience is assigned according to its priority Y(s)t,at,rt,st+1) Store in loss buffer pool Rfailure。
Further, in the step A1, experience(s)t,at,rt,st+1) The formula for calculating the priority Y is as follows:
Y=Azt+Bδt
wherein A and B are error coefficients, ztAs a qualification factor, δtIn order to be the error of the TD,
wherein, the qualification factor ztComprises the following steps:
where γ is the discount coefficient, λ is the trace-attenuation parameter,
is in a state s
tThen, take action a
tThe return value of (2);
TD error deltatComprises the following steps:
δt=Rt+γQ'(st,argmaxQ(st,at)-Q(st-1,at-1))
in the formula, RtFor the reward obtained at time t, Q' (. cndot.) is a behavior selection function corresponding to the state in order to maximize the reward.
Further, the step S36 is specifically:
b1, calculating the track similarity D between the tracks generated by a plurality of agents through a dynamic time warping algorithms;
B2 similarity D of any two trackssAs the inter-class distance;
and B3, dividing the agents corresponding to the inter-class distance smaller than the set threshold into a group, and realizing grouping of the agents.
Further, in step S37, performing an intra-group evolution operation on each group of agents to obtain a new group of agents specifically includes:
c1, regarding all agents as population pop, and coding each agent;
c2, calculating the fitness of the agents in the population pop based on the codes of the agents;
c3, initializing an empty population newport;
c3, adding the agent with the maximum fitness into the newport population;
c4, selecting the remaining agents in the population pop, adding the selected agents into the population newport, and updating the population newport;
and C5, replacing the population pop with the updated population newport, and decoding each agent in the population newport to obtain a new agent group.
Further, in the step C1, the method for encoding each Agent in the population pop includes:
carrying out real number coding on the weight of the neural network of the agent, wherein the coding mode is as follows:
Agent=[w11...wx'y'...wxy,w11...wy'q'...wyq]
in the formula, wx'y'Is the weight, w, of the input layer neuron x 'to the hidden layer neuron y' in the neural network of the agenty'q'The weight from hidden layer neuron y 'to output layer neuron q' in the hidden layer neuron network;
in step C2, the fitness f (b) of the agent is:
f(i)=Rb+q∑δi
in the formula, RbFor the total award, sigma delta, of agent b for each roundiIs the total loss value of agent b, q is the learning rate;
the step C4 specifically includes:
c41, selecting the agent by roulette selection among the remaining agents within the population pop;
c42, generating a new Agent from the selected Agent based on the set cross probability and mutation probability;
wherein the cross probability PcComprises the following steps:
Pc=1-0.5/(1+eΔf)
probability of variation PmComprises the following steps:
Pm=1-0.05/(1+eΔf)
in the formula, Δ f is the difference between the maximum fitness value and the average fitness value in the population pop;
and C43, adding the generated new agent into the newport group until the number of agents in the newport group reaches a set number, and finishing the update of the newport group.
The invention has the beneficial effects that:
(1) the method starts from the angle of the driving track, and comprehensively considers a plurality of aspects such as an environment model, road attributes and the like so as to improve the accuracy of flow distribution prediction;
(2) according to the method, firstly, a road network is modeled, the connection relation among road points is extracted to serve as the basis of simplifying the road network, then the road network is simplified and operated, so that the workload is reduced for the subsequent maximum entropy algorithm, the problem of overlarge environment model is solved, and the link relation among intersections in the road network cannot be damaged;
(3) according to the method, the characteristic that an expert strategy set is used as a return function in a research object mining environment in inverse reinforcement learning is utilized, and unique urban resident driving track characteristics are mined;
(4) the invention provides a multi-experience discrimination multi-agent reinforcement learning algorithm based on in-group evolution on the basis of a DDPG algorithm, researches that a plurality of agents can quickly find a target in an unknown environment by carrying out a plurality of mechanisms of information interaction group internal heredity isocratic, and can quickly and effectively find an optimal strategy in the environment compared with the traditional learning algorithm.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Example 1:
as shown in fig. 1, the method for predicting regional traffic distribution in a variable traffic control scheme based on the WMGIRL algorithm includes the following steps:
s1, carrying out urban road network modeling on the area to be predicted, and collecting the driving track data in the area;
s2, extracting flow characteristics in the area to be predicted through a maximum entropy inverse reinforcement learning method based on multiple weights based on the urban road network and the traffic track data corresponding to the area to be predicted;
and S3, processing the urban road network based on the extracted traffic characteristics and the current traffic control scheme by a forward reinforcement learning method to obtain a traffic distribution prediction result of the area to be predicted.
In step S1, when modeling the urban road network, all things and related relations in the real world are abstracted reasonably to form a type of form formed by nodes and edges according to corresponding criteria, and the corresponding entity network can be expressed by G (V, E); based on this, the method for performing urban road network modeling on the area to be predicted in step S1 specifically includes:
a1, in the area to be predicted, regarding the intersection as a point V and regarding a road E between two adjacent intersections as a side E to obtain an entity road network G (V, E);
where V represents a finite set of "nodes" V ═ V1,v2,...,vmAnd E represents the set of two finite "edges" E ═ E1,e2,...,enWherein e ═ vi,vjDenotes that an edge e is formed by two nodes vi,vjTo determine directional rays, for a network of n nodes without authority to take charge, the graph is adjoined by a matrix A (G))=[aij]n*nWherein:
r2, regarding the crossing in the entity road network G (V, E) as the state in the Markov process, regarding the road as the behavior required by the intelligent agent in the state transition, and further analogizing the entity road network G (V, E) to the environment model;
r3, determining an environment matrix F of the environment model based on the connectivity of the road network in the entity road network G (V, E), and determining a state transition matrix P according to the attributes of the roads in the entity road network G (V, E), thereby completing the modeling of the city road network;
wherein, the environment matrix F is:
F=[(e1,l1),(e2,l2),...,(ef,lf)]
wherein (e)f',lf')∈F,ef'Is a road network node,/f'For the road network node as ef'The number of motor vehicle lanes, subscript f 'is a road network node ordinal number, and f' is 1,2, 3.
The state transition matrix P is a three-dimensional matrix with the size of m multiplied by n multiplied by m established on the basis of an environment matrix, wherein m is the number of states, and n is the number of behaviors; each element P in the state transition matrix Psas'Indicating that taking action a in state s transitions to a new one of states s', and Psas'∈[0,1]。
In the above-mentioned urban road network modeling process, a certain node e in such a network is usually used
iCorresponding degree k
iCan be defined as the number of all nodes interconnected with it, a greater degree of a node means that the node is more important. For directed graphs, the degree of a node can generally be divided into two. Node e
iDegree of penetration of
Can be recognized as pointing to node e
iNumber of nodes involved, degree of outing
Can be considered as the number of all nodes pointed to by node i. Then can derive
In general, in a complex network, an "edge" can represent an association that exists for each node. For the traffic network system in the city, it is not comprehensive in practice to express the mutual association between the adjacent intersections by the general road section is not the connection and the direction; in terms of a finer granularity, it should be taken into account that the lanes can also be divided into motor lanes and non-motor lanes; the most important attributes of a lane include the actual number of lanes, the width, the linearity and the gradient of the road, etc.
Because of the constraint of the relevant research conditions, the acquisition of all parameters in the current road network hierarchy still has certain difficulty, so the practical influence caused by the number of motor vehicle lanes is considered at first, and other elements are slowly subjected to empirical analysis along with the promotion of research work. This is also typically done during road element analysis. Only the number of lanes attribute was included in the study. Therefore, it is not only easy to use
E={(e1,l1),(e2,l2),...,(en,ln)}
Wherein l is the number of lanes of the motor vehicle.
The step S2 is specifically:
s21, constructing an expert strategy set T ═ ξ { ξ) based on urban road network and collected driving track data0,ξ1,...,ξm};
Wherein ξi∈T,ξiFor expert strategies, the index i is 1,2,3iIs xii={(s0,a0),(s1,a1),...,(sn,an)},(st,at)∈ξi,(st,at) Is a binary group, stIs in a state oftFor activity, the subscript t is the tth time, t 1,2, 3.
S22, setting each expert strategy xi in the expert strategy set TiState of(s)tSetting a characteristic expectation weight, and calculating the characteristic expectation of an expert strategy;
s23, setting a characteristic weight coefficient x based on the condition that the lengths of the expert strategies in the expert strategy set are different, and initializing the characteristic weight coefficient x;
s24, determining the state return rate c according to the characteristic weight coefficient x, and further obtaining a reward function
And a state action value function p (s' | s, a);
s25, calculating the expected state access frequency through a forward and backward algorithm
S26, accessing frequency based on expected state
Updating the feature weight coefficients by using the gradient vectors;
s26, repeating the steps S24-S25 until convergence, and entering the step S27;
when the characteristic weight coefficient is not changed any more, convergence is achieved;
and S27, taking the current reward function as the optimal reward function to obtain the flow characteristics.
In the above process, when the agent learns the reward function in the environment model, if the expert policy set can access all states in the environment, the agent directly learns the expert policy set, and it is the most intuitive method to mine the potential reward function information from the expert policy setIn the states, states in which the expert strategies often pass through are voted for the optimal state, the frequently-passed states are weighted, and the intelligent agent is guided to mainly consider the high-frequency state in the process of learning the return function. In addition, given expert strategies may be poor, so that the expert strategies need to be analyzed and information provided by all tracks is synthesized to evaluate each expert strategy, and a brand-new expert strategy set is formed. Thus, the present embodiment summarize the introduction of two weights z for the expert policy and the state in the environment, respectivelynAnd l;
from the perspective of the probability model, the inverse reinforcement learning is performed, and the track distribution probability model is solved and generated under the assumption that a hidden probability distribution exists, which is a known expert strategy. Some states in the track set may be that most tracks pass through the states, and the importance of the states in the practical application environment is not always the same, such as the vehicle always tends to drive to a road with a wider lane and avoids a downtown area which is easy to be blocked, so the weight of each state in the environment is different, and the expert strategy ξ is used foriState of(s)tIs expected weight z of the featurenDefining:
the input expert strategy set has the condition that the lengths of expert strategies in m expert strategies are different, the invention provides the probability of weight factors, so that each expert strategy has the same weight in the expert strategy set, and the expert strategy set is supposed to have m expert strategies:
T={ξ0,ξ1,...,ξm}
wherein the length of the expert strategy is q, given m expert strategy strategies, in the step S22, the expert strategy xi
iCharacteristic expectation of
Comprises the following steps:
in the formula, l is expert strategy xi
iLength weight of (d), γ
tIn order to obtain the learning rate of the algorithm,
as a function of the reward, i.e. the reward given to the ith trace for its state at time t,
the state of the ith track at the time t is shown;
wherein the expert strategy xiiThe length weight of (l) is:
in the formula, len (-) is the length of the expert strategy;
the state action value function p (S' | S, a) in step S24 is:
in the formula, p (s '| s, a) represents the probability that agent i takes action a in state s, reaching state s'. In step S26, gradient vectors are used
The formula for updating the feature weight coefficients is:
wherein L (theta) is a maximum likelihood function,
is root ofAccording to the access frequency actually calculated by the expert track,
is a binary group(s)
t,a
t) The desired frequency of the frequency of,
is the corresponding feature vector.
Step S3 of this embodiment specifically includes:
s31, and based on the expert strategy set T ═ ξ { ξ) corresponding to the urban road network under the current traffic control scheme0,ξ1,...,ξmConstructing an environment model M;
s32, arranging a plurality of agents in the constructed environment model, and arranging three experience cache pools in each agent;
wherein, the three experience cache pools in each agent are priority cache pools R respectivelypriorityRevenue caching pool RsuccessAnd a loss buffer pool Rfailure;
S33, each agent selects a behavior a in turntModifying the state s of a current agenttAnd gives a corresponding return value rtAnd then determining the corresponding experience(s)t,at,rt,st+1);
S34, based on the reported value rtWill experience(s)t,at,rt,st+1) And the corresponding priority Y is stored in an experience cache pool corresponding to the agent;
s35, slave agent and state S thereoftExtracting experiences from an experience cache pool of the nearest agent according to a set proportion to form a training set, and training the DDPG network structure of the agent;
s36, and the track similarity D of each trained agentsClustering is carried out to realize grouping of the agents;
s37, carrying out intra-group evolution operation on each group of agents obtained by grouping to obtain a plurality of new agent subgroups to form a track set of the current city road network;
and S38, counting the flow of each road section in the urban road network based on the track set of the current urban road network, and realizing the flow distribution prediction.
In the step S32, the present invention introduces an experience buffer pool mechanism to reserve the original priority buffer pool RpriorityMeanwhile, two new experience cache pools are created, namely a profit cache pool RsuccessAnd a loss buffer pool RfailureThe profit cache pool is used for storing experience of obtaining profits, the loss cache pool is used for storing experience of receiving punishments, the experience is judged while the experience calculation priority is put into the priority cache pool, and if the experience belongs to the profit experience or the punishment experience, the experience is stored into the corresponding experience cache pool.
The step S34 is specifically:
r1, experience(s)t,at,rt,st+1) Deposit to priority cache pool RpriorityIn calculating the experience(s)t,at,rt,st+1) And stores the priority Y into a priority cache pool Rpriority;
Therein, experience(s)t,at,rt,st+1) The formula for calculating the priority Y is as follows:
Y=Azt+Bδt
wherein A and B are error coefficients, ztAs a qualification factor, δtIn order to be the error of the TD,
wherein, the qualification factor ztComprises the following steps:
where γ is the discount coefficient, λ is the trace-attenuation parameter,
is in a state s
tThen, take action a
tThe return value of (2);
TD error deltatComprises the following steps:
δt=Rt+γQ'(st,argmaxQ(st,at)-Q(st-1,at-1))
in the formula, RtFor the reward obtained at time t, Q' (. cndot.) is a behavior selection function corresponding to the state in order to maximize the reward.
R2, experience(s)t,at,rt,st+1) Is given a return value rtTo be positive, the experience(s) is given according to its priority Yt,at,rt,st+1) Deposit into profit cache pool Rsuccess;
When the experience(s)t,at,rt,st+1) Is given a return value rtWhen negative, the experience is assigned according to its priority Y(s)t,at,rt,st+1) Store in loss buffer pool Rfailure。
In step S35, when the agent learns, a certain amount of experience is extracted from the three experience pools according to a specific ratio, so that the relevance between data can be eliminated, the stability of neural network training is improved, training oscillation is reduced, and the extraction ratios from the priority cache pool, the profit cache pool, and the loss cache pool are as follows:
wherein, epicodenIs the current round number, epsilonmaxIs the total number of rounds.
Alpha is smaller along with the increase of the number of rounds, so that the intelligent body pays more attention to the preferred cache pool in the training period, the experience in the profit cache is reduced by a few unknown exploration actions, and when the experience is extracted from the experience cache pools of other intelligent bodies, the intelligent body collects the state with the closest distance between the experience pool and the state according to the state of the intelligent body, so that the experience of the other intelligent bodies exploring near the s state is fully utilized, and the environment same as the repeated exploration is effectively avoided. Therefore, the step S36 is specifically:
b1, calculating the track similarity D between the tracks generated by a plurality of agents through a dynamic time warping algorithms;
B2 similarity D of any two trackssAs the inter-class distance;
specifically, when two-day tracks are calculated, each point in the tracks is calculated until the point reaches the end point, all calculated distances are accumulated to obtain the track similarity in the invention, a most regular track with the minimum accumulated distance is found by utilizing the idea of dynamic planning, the accumulated distance is the path similarity between the two tracks, and the specific formula is as follows:
and B3, dividing the agents corresponding to the inter-class distance smaller than the set threshold into a group, and realizing grouping of the agents.
In the step S37, the method for performing the intra-group evolution operation in each group of agents to obtain a new group of agents specifically includes:
c1, regarding all agents as population pop, and coding each agent;
c2, calculating the fitness of the agents in the population pop based on the codes of the agents;
c3, initializing an empty population newport;
c3, adding the agent with the maximum fitness into the newport population;
c4, selecting the remaining agents in the population pop, adding the selected agents into the population newport, and updating the population newport;
and C5, replacing the population pop with the updated population newport, and decoding each agent in the population newport to obtain a new agent group.
In the process, when a plurality of agents are put into the environment for training at one time, when each round is finished, the DDPG models trained in the environment are good in quality, at the moment, the elite reservation strategy is used for reserving the models which are good in performance, the genetic evolution strategy is used for carrying out cross variation on the models which are poor in quality, and then the models are put into the environment as a new population for carrying out training of the next round.
Firstly, real number coding is performed on the neural network weight of each Agent, and in step C1, the method for coding each Agent in the population pop includes:
carrying out real number coding on the weight of the neural network of the agent, wherein the coding mode is as follows:
Agent=[w11...wx'y'...wxy,w11...wy'q'...wyq]
in the formula, wx'y'Is the weight, w, of the input layer neuron x 'to the hidden layer neuron y' in the neural network of the agenty'q'The weight from hidden layer neuron y 'to output layer neuron q' in the hidden layer neuron network;
in step C2, the genetic algorithm summarizes the optimization process of the DDPG neural network of the intelligent agent, and evaluates the quality of the individual according to the fitness function so as to achieve the purpose of eliminating the quality; the fitness f (b) of the agent is as follows:
f(i)=Rb+q∑δi
in the formula, RbFor the total award, sigma delta, of agent b for each roundiIs the total loss value of agent b, q is the learning rate;
when each agent is selected, in order to prevent the optimal solution generated in the evolution process from being damaged by crossover and variation, the present invention adopts an elite retention strategy to copy the individuals with the maximum fitness in each group to the next generation without any change, and therefore, the step C4 is specifically:
c41, selecting the agent by roulette selection among the remaining agents within the population pop;
the probability of roulette selection is:
c42, generating a new Agent from the selected Agent based on the set cross probability and mutation probability;
because of cross entropy PcAnd the mutation probability PmThe setting of (2) can influence the convergence of the genetic algorithm and increase the error of the optimal solution and the true solution to a great extent, and the cross probability and the mutation probability which can be adjusted in a self-adaptive manner are used in the invention to ensure the diversity of the population:
wherein the cross probability PcComprises the following steps:
Pc=1-0.5/(1+eΔf)
probability of variation PmComprises the following steps:
Pm=1-0.05/(1+eΔf)
in the formula, Δ f is the difference between the maximum fitness value and the average fitness value in the population pop;
and C43, adding the generated new agent into the newport group until the number of agents in the newport group reaches a set number, and finishing the update of the newport group.
Since the coding scheme adopted in this embodiment is a real number coding scheme, the intersection and mutation operations respectively adopt an arithmetic intersection operation and a uniform mutation operation, the arithmetic intersection operation means that two agents generate two new agents by linear combination, and it is assumed that Agent of the Agent generates two new agentsa,AgentbThe arithmetic intersection is performed, and two new individuals generated after the intersection operation are:
Agent'a=Pc·Agenta+(1-Pa)·Agentb
Agent'b=Pc·Agentb+(1-Pc)·Agenta
the embodiment adopts uniform mutation operation, and is provided with an intelligent agent for coding in the form of the above-mentioned code, if wx'y'Is a variation point wx'y'=[ωmin,ωmax]For variation point wx'y'Make a random union with the mutation probability PmComparing, if less than, at [ omega ]min,ωmax]Selecting one random fragment w'x'y'Substitution of wx'y'。
Example 2:
in the embodiment, a flow simulation platform under a variable traffic control scheme built based on the flow distribution prediction method is provided,
the platform can predict regional flow distribution under different traffic control schemes on the basis of real urban road networks and data in buckles, the framework of the flow simulation platform based on the variable traffic control scheme is shown in figure 2, and the flow simulation platform is provided with three sub-modules including a road network module, a flow module and an algorithm module. The road network module models an urban road network in a specified urban specified range; the traffic module is used for excavating driving track data by utilizing the bayonet data of a certain time period within the specified range of the city; the algorithm module is divided into a flow characteristic extraction module and a flow simulation algorithm module, wherein the flow characteristic extraction module is used for extracting flow characteristics of the area at specific time by utilizing an improved maximum entropy-based reinforcement learning algorithm on the basis of a road network module and a flow module, the flow simulation module is used for carrying out flow simulation by taking the calculated flow characteristics and an urban road network under a traffic control scheme as input through forward reinforcement learning, and finally, the flow is displayed in a visual mode.
The embodiment always simulates the traffic control scheme of the Mianyang Cone device in 9 months in 2019 to predict the area traffic of the traffic control scheme about road section prohibition, and verifies the accuracy of the area traffic prediction of the traffic control scheme about intersection turning prohibition by simulating the traffic control scheme based on turning prohibition in the traffic maneuver of the Mianyang CBD area of the Mianyang city in 12 months in 2019.
(1) Road network selection explanation:
regional traffic prediction for traffic control schemes for road section prohibition is performed by consulting a traffic police team in Yangyang city and consulting a smart city project worker in Yangyang city to select a main urban area in Yangyang city, and the selected range is a rectangular range surrounded by four points of a dragon stream logistics park (104.717637,31.513032), an Yangyang suburban airport (104.749689,31.435669), a miller hub (104.603876,31.450148) and a youth square (104.774194, 31.46093). As shown in fig. 3(a, road network range, b, yang road network), the regional road network is crawled and yang road network is simplified. There are 3200 road segments and 1016 intersections.
(2) Preparing data:
the experiment predicts the flow distribution situation from 8 points to 9 points of the early peak at 9 and 6 days in 2019 during the cobble, the number of the expert strategy sets for combing the early peak is 302 from the bayonet data from 1 day in 9 and 1 days in 2019 to 5 days in 9 and 5 days in 2019, the minimum number of tracks in one expert strategy set is 43, the maximum number of tracks is 809, and the average length of the tracks is 46.
(3) Traffic control scheme description:
the traffic police department in the Mianyang city during the Mianyang scientific society implements road ban on the following roads: (a) one-way traffic is carried out from the flying cloud great road winery intersection to the mountain road to the rubdown intersection; (b) the Liaoning Daodao Li Hospital intersection passes to the old Yong Lu Li Hospital intersection in one way; (c) the old Yong an Lu Xin automobile intersection goes to the Bubo garden intersection for one-way traffic, as shown in FIG. 4.
(4) Simulation experiment process:
in order to verify the accuracy and reliability of the algorithm provided by the present invention, the flow rate during the kobo meeting is subjected to a plurality of flow rate simulations for the mianyang urban area after the traffic control scheme is applied during the mianyang kobo meeting, and the experimental result is shown in fig. 5. The non-path-blocking prediction performed at 9 am of the first day by the cobbleshoot can show that the core circle and the control circle have congestion conditions at 9 am in the early peak period, red and orange congestion stages are displayed in a large range in the exhibition center of the cobbleshoot at the moment according to thermodynamic diagrams, after unidirectional path blocking is performed, orange congestion conditions of a small number of road sections occur in the core and control area, and light green basic smoothness is displayed in most areas. Different traffic states correspondingly appear according to different road sections in the non-road-closing state in the shunting area, and the orange jam condition appears on the road section close to the control and management area after the road is closed.
As can be seen from a comparison of fig. 5(a) and 5(b), the predicted flow distribution is similar to the actual flow distribution. Fig. 5(c) compares the predicted data of all the unidirectional road segments in the road network with the actual data, and it can be seen that the predicted data and the actual data have the same distribution trend in the road network. Then, the first 20 road segments with the largest vehicles passing through the road network are taken out to observe the graph 5(d), and the data predicted by the algorithm is slightly larger than the actual data, so that the data is used
Representing the actual flow rate, f(s) representing the predicted flow rate. In summary, the accuracy in the area traffic prediction experiment for the mianyang urban area after the platform implements the traffic control scheme during the mianyang scientific fair is as follows: