CN113053122B

CN113053122B - Prediction method of regional flow distribution in variable traffic control scheme based on WMGIRL algorithm

Info

Publication number: CN113053122B
Application number: CN202110305552.3A
Authority: CN
Inventors: 郑皎凌; 张中雷; 李军; 吴昊昇; 乔少杰; 刘双侨
Original assignee: Sichuan Yifang Intelligent Technology Co ltd; Chengdu University of Information Technology
Current assignee: Sichuan Yifang Intelligent Technology Co ltd; Chengdu University of Information Technology
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2022-02-18
Anticipated expiration: 2041-03-23
Also published as: CN113053122A

Abstract

The invention discloses a method for predicting the regional traffic flow distribution under a variable traffic control scheme based on a weighted multi-agent group inverse reinforcement learning algorithm - Weighted Multiagent Group Inverse Reinforcement Learning (WMGIRL), including: S1. network modeling, and collect the driving trajectory data in the area; S2, based on the urban road network and driving trajectory data corresponding to the area to be predicted, extract the traffic in the area to be predicted by the maximum entropy inverse reinforcement learning method based on weighted (Weighted). feature; S3. Based on the extracted traffic features and the urban road network under the current traffic control scheme, adopt the forward reinforcement learning method based on multi-agent group to process it, and obtain the traffic flow distribution in the area to be predicted. forecast result.

Description

WMGIRL algorithm-based regional flow distribution prediction method in variable traffic control scheme

Technical Field

The invention belongs to the technical field of urban traffic flow prediction, and particularly relates to a variable traffic control scheme summary area flow distribution prediction method based on a WMGIRL (multi-agent reverse reinforcement learning based on in-group evolution) algorithm.

Background

Along with the continuous improvement of the comprehensive strength of China, a large number of cities have been held and operated large-scale activities one after another, and urban roads cannot meet the rapidly increasing traffic demands in a short time. In recent years, as the research on urban intelligent traffic is more and more intensive, the research on two problems of traffic guidance technology and traffic control has become a core research method of urban intelligent traffic, the current core technology of traffic guidance technology and traffic control is still in the research on short-term traffic flow, and with the continuous practice and development of urban traffic flow prediction methods, experts and scholars all over the world at present strive to develop short-term traffic flow prediction methods, which mainly include the following types:

(1) the naive mathematical statistical method carries out prediction research on the urban traffic flow;

(2) carrying out prediction research on urban traffic flow based on an artificial intelligence method;

(3) carrying out prediction research on urban traffic flow based on a nonlinear theory method;

urban traffic flow prediction is a hotspot of research in decades in the future, the complexity of the current urban traffic types is high, the scale-to-scale ratio of the traffic network is high, how to comprehensively arrange various traffic modes and complex traffic networks in one environment to build a model on the basis of the methods, and the method realizes the acquisition and screening of traffic flow data to predict the traffic flow more accurately, and is a problem worthy of thinking.

Disclosure of Invention

In order to overcome the defects in the prior art, the method for predicting the regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm is based on the angle of the driving track, and is comprehensively considered from multiple aspects such as an environment model and road attributes so as to improve the accuracy of flow distribution prediction.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: the method for predicting regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm comprises the following steps:

s1, carrying out urban road network modeling on the area to be predicted, and collecting the driving track data in the area;

s2, extracting flow characteristics in the area to be predicted through a maximum entropy inverse reinforcement learning method based on multiple weights based on the urban road network and the traffic track data corresponding to the area to be predicted;

and S3, processing the urban road network based on the extracted traffic characteristics and the current traffic control scheme by a forward reinforcement learning method to obtain a traffic distribution prediction result of the area to be predicted.

Further, in step S1, the method for modeling an urban road network for the area to be predicted specifically includes:

a1, in the area to be predicted, regarding the intersection as a point V and regarding the road between two adjacent intersections as a side E to obtain an entity road network G (V, E);

a2, regarding the crossing in the entity road network G (V, E) as the state in the Markov process, regarding the road as the behavior required by the intelligent agent in the state transition, and further analogizing the entity road network G (V, E) to the environment model;

a3, determining an environment matrix F of the environment model based on the connectivity of the road network in the entity road network G (V, E), and determining a state transition matrix P according to the attributes of the roads in the entity road network G (V, E), thereby completing the modeling of the city road network;

wherein, the environment matrix F is:

F＝[(e₁,l₁),(e₂,l₂),...,(e_f,l_f)]

wherein (e)_f',l_f')∈F，e_f'Is a road network node,/_f'For the road network node as e_f'The number of motor vehicle lanes, subscript f 'is a road network node ordinal number, and f' is 1,2, 3.

The state transition matrix P is a three-dimensional matrix with the size of m multiplied by n multiplied by m established on the basis of an environment matrix, wherein m is the number of states, and n is the number of behaviors; each element P in the state transition matrix P_sas'Indicating that taking action a in state s transitions to a new one of states s', and P_sas'∈[0,1]。

Further, the step S2 is specifically:

s21, constructing an expert strategy set T ═ ξ { ξ) based on urban road network and collected driving track data₀,ξ₁,...,ξ_m}；

Wherein ξ_i∈T，ξ_iFor expert strategies, the index i is 1,2,3_iIs xi_i＝{(s₀,a₀),(s₁,a₁),...,(s_n,a_n)}，(s_t,a_t)∈ξ_i，(s_t,a_t) Is a binary group, s_tIs in a state of_tFor activity, the subscript t is the tth time, t 1,2, 3.

S22, setting each expert strategy xi in the expert strategy set T_iState of(s)_tSetting a characteristic expectation weight, and calculating the characteristic expectation of an expert strategy;

s23, setting a characteristic weight coefficient x based on the condition that the lengths of the expert strategies in the expert strategy set are different, and initializing the characteristic weight coefficient x;

s24, determining the state return rate c according to the characteristic weight coefficient x, and further obtaining a reward function

And a state action value function p (s' | s, a);

s25, calculating the expected state access frequency through a forward and backward algorithm

S26, accessing frequency based on expected state

Updating the feature weight coefficients by using the gradient vectors;

s26, repeating the steps S24-S25 until convergence, and entering the step S27;

and S27, taking the current reward function as the optimal reward function to obtain the flow characteristics.

Further, in the step S22, the expert strategy ξ_iState of(s)_tIs expected weight z of the feature_nComprises the following steps:

in the step S22, the expert policy ξ_iCharacteristic expectation of

Comprises the following steps:

in the formula, l is expert strategy xi_iLength weight of (d), γ^tIn order to obtain the learning rate of the algorithm,

as a function of the reward, i.e. the reward given to the ith trace for its state at time t,

the state of the ith track at the time t is shown;

wherein the expert strategy xi_iThe length weight of (l) is:

in the formula, len (-) is the length of the expert strategy;

the state action value function p (S' | S, a) in step S24 is:

in the formula, p (s '| s, a) represents the probability that agent i takes action a in state s, reaching state s'.

In step S26, gradient vectors are used

The formula for updating the feature weight coefficients is:

wherein L (theta) is a maximum likelihood function,

for the access frequency actually calculated from the expert trajectory,

is a binary group(s)_t,a_t) The desired frequency of the frequency of,

is the corresponding feature vector.

Further, the step S3 is specifically:

s31, and based on the expert strategy set T ═ ξ { ξ) corresponding to the urban road network under the current traffic control scheme₀,ξ₁,...,ξ_mConstructing an environment model M;

s32, arranging a plurality of agents in the constructed environment model, and arranging three experience cache pools in each agent;

wherein, the three experience cache pools in each agent are priority cache pools R respectively_priorityRevenue caching pool R_successAnd a loss buffer pool R_failure；

S33, each agent selects a behavior a in turn_tModifying the state s of a current agent_tAnd gives a corresponding return value r_tAnd then determining the corresponding experience(s)_t,a_t,r_t,s_t+1)；

S34, based on the reported value r_tWill experience(s)_t,a_t,r_t,s_t+1) And the corresponding priority Y is stored in an experience cache pool corresponding to the agent;

s35, slave agent and state S thereof_tExtracting experiences from an experience cache pool of the nearest agent according to a set proportion to form a training set, and training the DDPG network structure of the agent;

s36, and the track similarity D of each trained agent_sClustering is carried out to realize grouping of the agents;

s37, carrying out intra-group evolution operation on each group of agents obtained by grouping to obtain a plurality of new agent subgroups to form a track set of the current city road network;

and S38, counting the flow of each road section in the urban road network based on the track set of the current urban road network, and realizing the flow distribution prediction.

Further, the step S34 is specifically:

r1, experience(s)_t,a_t,r_t,s_t+1) Deposit to priority cache pool R_priorityIn calculating the experience(s)_t,a_t,r_t,s_t+1) And stores the priority Y into a priority cache pool R_priority；

R2, experience(s)_t,a_t,r_t,s_t+1) Is given a return value r_tTo be positive, the experience(s) is given according to its priority Y_t,a_t,r_t,s_t+1) Deposit into profit cache pool R_success；

When the experience(s)_t,a_t,r_t,s_t+1) Is given a return value r_tWhen negative, the experience is assigned according to its priority Y(s)_t,a_t,r_t,s_t+1) Store in loss buffer pool R_failure。

Further, in the step A1, experience(s)_t,a_t,r_t,s_t+1) The formula for calculating the priority Y is as follows:

Y＝Az_t+Bδ_t

wherein A and B are error coefficients, z_tAs a qualification factor, δ_tIn order to be the error of the TD,

wherein, the qualification factor z_tComprises the following steps:

where γ is the discount coefficient, λ is the trace-attenuation parameter,

is in a state s_tThen, take action a_tThe return value of (2);

TD error delta_tComprises the following steps:

δ_t＝R_t+γQ'(s_t,argmaxQ(s_t,a_t)-Q(s_t-1,a_t-1))

in the formula, R_tFor the reward obtained at time t, Q' (. cndot.) is a behavior selection function corresponding to the state in order to maximize the reward.

Further, the step S36 is specifically:

b1, calculating the track similarity D between the tracks generated by a plurality of agents through a dynamic time warping algorithm_s；

B2 similarity D of any two tracks_sAs the inter-class distance;

and B3, dividing the agents corresponding to the inter-class distance smaller than the set threshold into a group, and realizing grouping of the agents.

Further, in step S37, performing an intra-group evolution operation on each group of agents to obtain a new group of agents specifically includes:

c1, regarding all agents as population pop, and coding each agent;

c2, calculating the fitness of the agents in the population pop based on the codes of the agents;

c3, initializing an empty population newport;

c3, adding the agent with the maximum fitness into the newport population;

c4, selecting the remaining agents in the population pop, adding the selected agents into the population newport, and updating the population newport;

and C5, replacing the population pop with the updated population newport, and decoding each agent in the population newport to obtain a new agent group.

Further, in the step C1, the method for encoding each Agent in the population pop includes:

carrying out real number coding on the weight of the neural network of the agent, wherein the coding mode is as follows:

Agent＝[w₁₁...w_x'y'...w_xy,w₁₁...w_y'q'...w_yq]

in the formula, w_x'y'Is the weight, w, of the input layer neuron x 'to the hidden layer neuron y' in the neural network of the agent_y'q'The weight from hidden layer neuron y 'to output layer neuron q' in the hidden layer neuron network;

in step C2, the fitness f (b) of the agent is:

f(i)＝R_b+q∑δ_i

in the formula, R_bFor the total award, sigma delta, of agent b for each round_iIs the total loss value of agent b, q is the learning rate;

the step C4 specifically includes:

c41, selecting the agent by roulette selection among the remaining agents within the population pop;

c42, generating a new Agent from the selected Agent based on the set cross probability and mutation probability;

wherein the cross probability P_cComprises the following steps:

P_c＝1-0.5/(1+e^Δf)

probability of variation P_mComprises the following steps:

P_m＝1-0.05/(1+e^Δf)

in the formula, Δ f is the difference between the maximum fitness value and the average fitness value in the population pop;

and C43, adding the generated new agent into the newport group until the number of agents in the newport group reaches a set number, and finishing the update of the newport group.

The invention has the beneficial effects that:

(1) the method starts from the angle of the driving track, and comprehensively considers a plurality of aspects such as an environment model, road attributes and the like so as to improve the accuracy of flow distribution prediction;

(2) according to the method, firstly, a road network is modeled, the connection relation among road points is extracted to serve as the basis of simplifying the road network, then the road network is simplified and operated, so that the workload is reduced for the subsequent maximum entropy algorithm, the problem of overlarge environment model is solved, and the link relation among intersections in the road network cannot be damaged;

(3) according to the method, the characteristic that an expert strategy set is used as a return function in a research object mining environment in inverse reinforcement learning is utilized, and unique urban resident driving track characteristics are mined;

(4) the invention provides a multi-experience discrimination multi-agent reinforcement learning algorithm based on in-group evolution on the basis of a DDPG algorithm, researches that a plurality of agents can quickly find a target in an unknown environment by carrying out a plurality of mechanisms of information interaction group internal heredity isocratic, and can quickly and effectively find an optimal strategy in the environment compared with the traditional learning algorithm.

Drawings

Fig. 1 is a flowchart of a method for predicting traffic distribution in a variable traffic control scheme based on the WMGIRL algorithm according to the present invention.

FIG. 2 is a diagram of a simulation platform framework in an embodiment of the present invention.

Fig. 3 is a schematic diagram of a regional road network in an embodiment of the present invention.

Fig. 4 is a schematic diagram illustrating a traffic control scheme according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of experimental comparison results in an example provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Example 1:

as shown in fig. 1, the method for predicting regional traffic distribution in a variable traffic control scheme based on the WMGIRL algorithm includes the following steps:

In step S1, when modeling the urban road network, all things and related relations in the real world are abstracted reasonably to form a type of form formed by nodes and edges according to corresponding criteria, and the corresponding entity network can be expressed by G (V, E); based on this, the method for performing urban road network modeling on the area to be predicted in step S1 specifically includes:

a1, in the area to be predicted, regarding the intersection as a point V and regarding a road E between two adjacent intersections as a side E to obtain an entity road network G (V, E);

where V represents a finite set of "nodes" V ═ V₁,v₂,...,v_mAnd E represents the set of two finite "edges" E ═ E₁,e₂,...,e_nWherein e ═ v_i,v_jDenotes that an edge e is formed by two nodes v_i,v_jTo determine directional rays, for a network of n nodes without authority to take charge, the graph is adjoined by a matrix A (G))＝[a_ij]_n*nWherein:

r2, regarding the crossing in the entity road network G (V, E) as the state in the Markov process, regarding the road as the behavior required by the intelligent agent in the state transition, and further analogizing the entity road network G (V, E) to the environment model;

r3, determining an environment matrix F of the environment model based on the connectivity of the road network in the entity road network G (V, E), and determining a state transition matrix P according to the attributes of the roads in the entity road network G (V, E), thereby completing the modeling of the city road network;

wherein, the environment matrix F is:

F＝[(e₁,l₁),(e₂,l₂),...,(e_f,l_f)]

In the above-mentioned urban road network modeling process, a certain node e in such a network is usually used_iCorresponding degree k_iCan be defined as the number of all nodes interconnected with it, a greater degree of a node means that the node is more important. For directed graphs, the degree of a node can generally be divided into two. Node e_iDegree of penetration of

Can be recognized as pointing to node e_iNumber of nodes involved, degree of outing

Can be considered as the number of all nodes pointed to by node i. Then can derive

In general, in a complex network, an "edge" can represent an association that exists for each node. For the traffic network system in the city, it is not comprehensive in practice to express the mutual association between the adjacent intersections by the general road section is not the connection and the direction; in terms of a finer granularity, it should be taken into account that the lanes can also be divided into motor lanes and non-motor lanes; the most important attributes of a lane include the actual number of lanes, the width, the linearity and the gradient of the road, etc.

Because of the constraint of the relevant research conditions, the acquisition of all parameters in the current road network hierarchy still has certain difficulty, so the practical influence caused by the number of motor vehicle lanes is considered at first, and other elements are slowly subjected to empirical analysis along with the promotion of research work. This is also typically done during road element analysis. Only the number of lanes attribute was included in the study. Therefore, it is not only easy to use

E＝{(e₁,l₁),(e₂,l₂),...,(e_n,l_n)}

Wherein l is the number of lanes of the motor vehicle.

The step S2 is specifically:

And a state action value function p (s' | s, a);

S26, accessing frequency based on expected state

Updating the feature weight coefficients by using the gradient vectors;

s26, repeating the steps S24-S25 until convergence, and entering the step S27;

when the characteristic weight coefficient is not changed any more, convergence is achieved;

In the above process, when the agent learns the reward function in the environment model, if the expert policy set can access all states in the environment, the agent directly learns the expert policy set, and it is the most intuitive method to mine the potential reward function information from the expert policy setIn the states, states in which the expert strategies often pass through are voted for the optimal state, the frequently-passed states are weighted, and the intelligent agent is guided to mainly consider the high-frequency state in the process of learning the return function. In addition, given expert strategies may be poor, so that the expert strategies need to be analyzed and information provided by all tracks is synthesized to evaluate each expert strategy, and a brand-new expert strategy set is formed. Thus, the present embodiment summarize the introduction of two weights z for the expert policy and the state in the environment, respectively_nAnd l;

from the perspective of the probability model, the inverse reinforcement learning is performed, and the track distribution probability model is solved and generated under the assumption that a hidden probability distribution exists, which is a known expert strategy. Some states in the track set may be that most tracks pass through the states, and the importance of the states in the practical application environment is not always the same, such as the vehicle always tends to drive to a road with a wider lane and avoids a downtown area which is easy to be blocked, so the weight of each state in the environment is different, and the expert strategy ξ is used for_iState of(s)_tIs expected weight z of the feature_nDefining:

the input expert strategy set has the condition that the lengths of expert strategies in m expert strategies are different, the invention provides the probability of weight factors, so that each expert strategy has the same weight in the expert strategy set, and the expert strategy set is supposed to have m expert strategies:

T＝{ξ₀,ξ₁,...,ξ_m}

wherein the length of the expert strategy is q, given m expert strategy strategies, in the step S22, the expert strategy xi_iCharacteristic expectation of

Comprises the following steps:

the state of the ith track at the time t is shown;

wherein the expert strategy xi_iThe length weight of (l) is:

in the formula, len (-) is the length of the expert strategy;

the state action value function p (S' | S, a) in step S24 is:

in the formula, p (s '| s, a) represents the probability that agent i takes action a in state s, reaching state s'. In step S26, gradient vectors are used

The formula for updating the feature weight coefficients is:

wherein L (theta) is a maximum likelihood function,

is root ofAccording to the access frequency actually calculated by the expert track,

is a binary group(s)_t,a_t) The desired frequency of the frequency of,

is the corresponding feature vector.

Step S3 of this embodiment specifically includes:

In the step S32, the present invention introduces an experience buffer pool mechanism to reserve the original priority buffer pool R_priorityMeanwhile, two new experience cache pools are created, namely a profit cache pool R_successAnd a loss buffer pool R_failureThe profit cache pool is used for storing experience of obtaining profits, the loss cache pool is used for storing experience of receiving punishments, the experience is judged while the experience calculation priority is put into the priority cache pool, and if the experience belongs to the profit experience or the punishment experience, the experience is stored into the corresponding experience cache pool.

The step S34 is specifically:

Therein, experience(s)_t,a_t,r_t,s_t+1) The formula for calculating the priority Y is as follows:

Y＝Az_t+Bδ_t

wherein, the qualification factor z_tComprises the following steps:

where γ is the discount coefficient, λ is the trace-attenuation parameter,

is in a state s_tThen, take action a_tThe return value of (2);

TD error delta_tComprises the following steps:

δ_t＝R_t+γQ'(s_t,argmaxQ(s_t,a_t)-Q(s_t-1,a_t-1))

In step S35, when the agent learns, a certain amount of experience is extracted from the three experience pools according to a specific ratio, so that the relevance between data can be eliminated, the stability of neural network training is improved, training oscillation is reduced, and the extraction ratios from the priority cache pool, the profit cache pool, and the loss cache pool are as follows:

wherein, epicode_nIs the current round number, epsilon_maxIs the total number of rounds.

Alpha is smaller along with the increase of the number of rounds, so that the intelligent body pays more attention to the preferred cache pool in the training period, the experience in the profit cache is reduced by a few unknown exploration actions, and when the experience is extracted from the experience cache pools of other intelligent bodies, the intelligent body collects the state with the closest distance between the experience pool and the state according to the state of the intelligent body, so that the experience of the other intelligent bodies exploring near the s state is fully utilized, and the environment same as the repeated exploration is effectively avoided. Therefore, the step S36 is specifically:

B2 similarity D of any two tracks_sAs the inter-class distance;

specifically, when two-day tracks are calculated, each point in the tracks is calculated until the point reaches the end point, all calculated distances are accumulated to obtain the track similarity in the invention, a most regular track with the minimum accumulated distance is found by utilizing the idea of dynamic planning, the accumulated distance is the path similarity between the two tracks, and the specific formula is as follows:

In the step S37, the method for performing the intra-group evolution operation in each group of agents to obtain a new group of agents specifically includes:

c1, regarding all agents as population pop, and coding each agent;

c3, initializing an empty population newport;

c3, adding the agent with the maximum fitness into the newport population;

In the process, when a plurality of agents are put into the environment for training at one time, when each round is finished, the DDPG models trained in the environment are good in quality, at the moment, the elite reservation strategy is used for reserving the models which are good in performance, the genetic evolution strategy is used for carrying out cross variation on the models which are poor in quality, and then the models are put into the environment as a new population for carrying out training of the next round.

Firstly, real number coding is performed on the neural network weight of each Agent, and in step C1, the method for coding each Agent in the population pop includes:

Agent＝[w₁₁...w_x'y'...w_xy,w₁₁...w_y'q'...w_yq]

in step C2, the genetic algorithm summarizes the optimization process of the DDPG neural network of the intelligent agent, and evaluates the quality of the individual according to the fitness function so as to achieve the purpose of eliminating the quality; the fitness f (b) of the agent is as follows:

f(i)＝R_b+q∑δ_i

when each agent is selected, in order to prevent the optimal solution generated in the evolution process from being damaged by crossover and variation, the present invention adopts an elite retention strategy to copy the individuals with the maximum fitness in each group to the next generation without any change, and therefore, the step C4 is specifically:

the probability of roulette selection is:

because of cross entropy P_cAnd the mutation probability P_mThe setting of (2) can influence the convergence of the genetic algorithm and increase the error of the optimal solution and the true solution to a great extent, and the cross probability and the mutation probability which can be adjusted in a self-adaptive manner are used in the invention to ensure the diversity of the population:

wherein the cross probability P_cComprises the following steps:

P_c＝1-0.5/(1+e^Δf)

probability of variation P_mComprises the following steps:

P_m＝1-0.05/(1+e^Δf)

Since the coding scheme adopted in this embodiment is a real number coding scheme, the intersection and mutation operations respectively adopt an arithmetic intersection operation and a uniform mutation operation, the arithmetic intersection operation means that two agents generate two new agents by linear combination, and it is assumed that Agent of the Agent generates two new agents_a，Agent_bThe arithmetic intersection is performed, and two new individuals generated after the intersection operation are:

Agent'_a＝P_c·Agent_a+(1-P_a)·Agent_b

Agent'_b＝P_c·Agent_b+(1-P_c)·Agent_a

the embodiment adopts uniform mutation operation, and is provided with an intelligent agent for coding in the form of the above-mentioned code, if w_x'y'Is a variation point w_x'y'＝[ω_min,ω_max]For variation point w_x'y'Make a random union with the mutation probability P_mComparing, if less than, at [ omega ]_min,ω_max]Selecting one random fragment w'_x'y'Substitution of w_x'y'。

Example 2:

in the embodiment, a flow simulation platform under a variable traffic control scheme built based on the flow distribution prediction method is provided,

the platform can predict regional flow distribution under different traffic control schemes on the basis of real urban road networks and data in buckles, the framework of the flow simulation platform based on the variable traffic control scheme is shown in figure 2, and the flow simulation platform is provided with three sub-modules including a road network module, a flow module and an algorithm module. The road network module models an urban road network in a specified urban specified range; the traffic module is used for excavating driving track data by utilizing the bayonet data of a certain time period within the specified range of the city; the algorithm module is divided into a flow characteristic extraction module and a flow simulation algorithm module, wherein the flow characteristic extraction module is used for extracting flow characteristics of the area at specific time by utilizing an improved maximum entropy-based reinforcement learning algorithm on the basis of a road network module and a flow module, the flow simulation module is used for carrying out flow simulation by taking the calculated flow characteristics and an urban road network under a traffic control scheme as input through forward reinforcement learning, and finally, the flow is displayed in a visual mode.

The embodiment always simulates the traffic control scheme of the Mianyang Cone device in 9 months in 2019 to predict the area traffic of the traffic control scheme about road section prohibition, and verifies the accuracy of the area traffic prediction of the traffic control scheme about intersection turning prohibition by simulating the traffic control scheme based on turning prohibition in the traffic maneuver of the Mianyang CBD area of the Mianyang city in 12 months in 2019.

(1) Road network selection explanation:

regional traffic prediction for traffic control schemes for road section prohibition is performed by consulting a traffic police team in Yangyang city and consulting a smart city project worker in Yangyang city to select a main urban area in Yangyang city, and the selected range is a rectangular range surrounded by four points of a dragon stream logistics park (104.717637,31.513032), an Yangyang suburban airport (104.749689,31.435669), a miller hub (104.603876,31.450148) and a youth square (104.774194, 31.46093). As shown in fig. 3(a, road network range, b, yang road network), the regional road network is crawled and yang road network is simplified. There are 3200 road segments and 1016 intersections.

(2) Preparing data:

the experiment predicts the flow distribution situation from 8 points to 9 points of the early peak at 9 and 6 days in 2019 during the cobble, the number of the expert strategy sets for combing the early peak is 302 from the bayonet data from 1 day in 9 and 1 days in 2019 to 5 days in 9 and 5 days in 2019, the minimum number of tracks in one expert strategy set is 43, the maximum number of tracks is 809, and the average length of the tracks is 46.

(3) Traffic control scheme description:

the traffic police department in the Mianyang city during the Mianyang scientific society implements road ban on the following roads: (a) one-way traffic is carried out from the flying cloud great road winery intersection to the mountain road to the rubdown intersection; (b) the Liaoning Daodao Li Hospital intersection passes to the old Yong Lu Li Hospital intersection in one way; (c) the old Yong an Lu Xin automobile intersection goes to the Bubo garden intersection for one-way traffic, as shown in FIG. 4.

(4) Simulation experiment process:

in order to verify the accuracy and reliability of the algorithm provided by the present invention, the flow rate during the kobo meeting is subjected to a plurality of flow rate simulations for the mianyang urban area after the traffic control scheme is applied during the mianyang kobo meeting, and the experimental result is shown in fig. 5. The non-path-blocking prediction performed at 9 am of the first day by the cobbleshoot can show that the core circle and the control circle have congestion conditions at 9 am in the early peak period, red and orange congestion stages are displayed in a large range in the exhibition center of the cobbleshoot at the moment according to thermodynamic diagrams, after unidirectional path blocking is performed, orange congestion conditions of a small number of road sections occur in the core and control area, and light green basic smoothness is displayed in most areas. Different traffic states correspondingly appear according to different road sections in the non-road-closing state in the shunting area, and the orange jam condition appears on the road section close to the control and management area after the road is closed.

As can be seen from a comparison of fig. 5(a) and 5(b), the predicted flow distribution is similar to the actual flow distribution. Fig. 5(c) compares the predicted data of all the unidirectional road segments in the road network with the actual data, and it can be seen that the predicted data and the actual data have the same distribution trend in the road network. Then, the first 20 road segments with the largest vehicles passing through the road network are taken out to observe the graph 5(d), and the data predicted by the algorithm is slightly larger than the actual data, so that the data is used

Representing the actual flow rate, f(s) representing the predicted flow rate. In summary, the accuracy in the area traffic prediction experiment for the mianyang urban area after the platform implements the traffic control scheme during the mianyang scientific fair is as follows:

Claims

1. the method for predicting regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm is characterized by comprising the following steps of:

s3, processing the urban road network based on the extracted traffic characteristics and the current traffic control scheme by a forward reinforcement learning method to obtain a traffic distribution prediction result of the area to be predicted;

the step S2 specifically includes:

Wherein ξ_i∈T，ξ_iFor expert strategies, the index i is 1,2,3Family strategy xi_iIs xi_i＝{(s₀,a₀),(s₁,a₁),...,(s_n,a_n)}，(s_t,a_t)∈ξ_i，(s_t,a_t) Is a binary group, s_tIs in a state of_tFor activity, the subscript t is the tth time, t 1,2, 3.

s24, determining the state return rate c according to the characteristic weight coefficient x, and further obtaining a reward function phi (S)_t ⁽ⁱ⁾) And a state action value function p (s' | s, a);

S26, accessing frequency based on expected state

Updating the feature weight coefficients by using the gradient vectors;

s26, repeating the steps S24-S25 until convergence, and entering the step S27;

2. The method for predicting the regional traffic distribution in the variable traffic control scheme based on the WMGIRL algorithm according to claim 1, wherein in the step S1, the method for modeling the urban road network of the region to be predicted specifically comprises:

wherein, the environment matrix F is:

F＝[(e₁,l₁),(e₂,l₂),...,(e_f,l_f)]

The state transition matrix P is a three-dimensional matrix with the size of m multiplied by n multiplied by m established on the basis of an environment matrix, wherein m is the number of states, and n is the number of behaviors; each element P in the state transition matrix P_sas'Represents the state s' and P of the transition to the next moment taking action a in state s_sas'∈[0,1]。

3. The method for predicting regional flow distribution in variable traffic control scheme based on WMGIRL algorithm according to claim 1, wherein in the step S22, expert strategy ξ_iState of(s)_tIs expected weight z of the feature_nComprises the following steps:

in the step S22, the expert policy ξ_iCharacteristic expectation of

Comprises the following steps:

the state of the ith track at the time t is shown;

wherein the expert strategy xi_iThe length weight of (l) is:

in the formula, len (-) is the length of the expert strategy;

the state action value function p (S' | S, a) in step S24 is:

in the formula, p (s '| s, a) represents the probability that the agent i takes the action a in the state s to reach the state s';

in step S26, gradient vectors are used

The formula for updating the feature weight coefficients is:

wherein L (theta) is a maximum likelihood function,

for the access frequency actually calculated from the expert trajectory,

is a binary group(s)_t,a_t) The desired frequency of the frequency of,

is the corresponding feature vector.

4. The method for predicting the regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm according to claim 1, wherein the step S3 specifically comprises:

s36, training eachTrajectory similarity D of trained agent_sClustering is carried out to realize grouping of the agents;

5. The method for predicting the regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm according to claim 4, wherein the step S34 specifically comprises:

6. The method for predicting the regional flow distribution in the variable traffic management scheme based on the WMGIRL algorithm of claim 5, wherein in the step R1, the experience(s)_t,a_t,r_t,s_t+1) The formula for calculating the priority Y is as follows:

Y＝Az_t+Bδ_t

wherein, the qualification factor z_tComprises the following steps:

where γ is the discount coefficient, λ is the trace-attenuation parameter,

is in a state s_tThen, take action a_tThe return value of (2);

TD error delta_tComprises the following steps:

δ_t＝R_t+γQ'(s_t,argmaxQ(s_t,a_t)-Q(s_t-1,a_t-1))

7. The method for predicting the regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm according to claim 5, wherein the step S36 specifically comprises:

B2 similarity D of any two tracks_sAs the inter-class distance;

8. The method for predicting the regional flow distribution in the variable traffic control scheme based on the WMGIRL algorithm according to claim 4, wherein in the step S37, performing an intra-group evolution operation on each group of agents to obtain a new group of agents specifically comprises:

c1, regarding all agents as population pop, and coding each agent;

c3, initializing an empty population newport;

c3, adding the agent with the maximum fitness into the newport population;

9. The method for predicting the regional traffic distribution in the variable traffic control scheme based on the WMGIRL algorithm of claim 8, wherein in the step C1, the method for encoding each Agent in the group pop comprises:

Agent＝[w₁₁...w_x'y'...w_xy,w₁₁...w_y'q'...w_yq]

in step C2, the fitness f (b) of the agent is:

f(i)＝R_b+q∑δ_i

the step C4 specifically includes:

wherein the cross probability P_cComprises the following steps:

P_c＝1-0.5/(1+e^Δf)

probability of variation P_mComprises the following steps:

P_m＝1-0.05/(1+e^Δf)