CN117610681A

CN117610681A - Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning

Info

Publication number: CN117610681A
Application number: CN202311623676.1A
Authority: CN
Inventors: 裴晓飞; 杨哲
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-02-27

Abstract

The invention provides an automatic driving automobile decision method based on imitation learning and discrete reinforcement learning, which is based on reinforcement learning DDQN algorithm, and utilizes a neural network to build reinforcement learning decision models under different driving scenes; designing different reward functions for different driving scenes; performing simulated learning pre-training on the reinforcement learning decision model based on driving demonstration data in different driving scenes to obtain a pre-training model; performing online interactive training on the pre-training model based on the reward function to obtain a final reinforcement learning decision model; and when a specific decision is made, a corresponding final reinforcement learning decision model is called for different driving scenes, the optimal decision action is selected from the action space and is connected with the bottom layer planning control module, and the bottom layer planning control module is responsible for action execution. The invention can improve the automatic driving decision performance.

Description

Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning

Technical Field

The invention belongs to the technical field of automatic driving, and particularly relates to an automatic driving automobile decision method based on imitation learning and discrete reinforcement learning.

Background

The decision body is an important research content and key technology in the intelligent traffic system, and can replace a driver to make judgment in time in a complex environment. The safe and effective decision model not only can ensure the driving safety and reduce the accident rate, but also can relieve the traffic pressure to a certain extent and improve the traffic efficiency. Thus, it becomes important to construct a reliable, efficient decision model. Most of the existing decisions are formulated based on rules, and the existing decisions have the characteristics of being interpretable, easy to build, mature and the like, but are difficult to adapt to all working conditions possibly encountered in the running process of the vehicle, so that decision errors can be caused, and decision disasters are caused. Whereas in recent years reinforcement learning based decision methods have shown adaptability to an uncertainty environment, they require a large amount of data to train to obtain robustness of the decision, thereby improving the performance of the decision.

Training is based on data that interacts with the environment, however, which is often time consuming. Especially for task scenes, such as complex scenes of up-down ramps, roundabout and the like, a large amount of high-quality interaction data and long training time are required, if self-exploration by reinforcement learning alone is very inefficient, and the data obtained by reinforcement learning exploration cannot be guaranteed to be high-quality, which also affects the performance of the decision model.

Disclosure of Invention

The invention aims to solve the technical problems that: the automatic driving automobile decision method based on the imitation learning and the discrete reinforcement learning is provided, and the automatic driving decision performance is improved.

The technical scheme adopted by the invention for solving the technical problems is as follows: an automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning, comprising the following steps:

s1, constructing reinforcement learning decision models under different driving scenes by using a neural network based on reinforcement learning DDQN algorithm for an automatic driving automobile, and carrying out random initialization; designing different reward functions for different driving scenes;

s2, driving demonstration data under different driving scenes are obtained, and simulation learning pre-training is conducted on the reinforcement learning decision model based on the driving demonstration data to obtain a pre-training model;

s3, performing online interactive training on the pre-training model based on the reward function, and further optimizing the reinforcement learning decision model until the reinforcement learning decision model is completely converged to obtain a final reinforcement learning decision model;

and S4, when a specific decision is made, a corresponding final reinforcement learning decision model is called for different driving scenes, the real-time traffic state of the automatic driving automobile is obtained and is used as the input of the final reinforcement learning decision model, the final reinforcement learning decision model selects the best decision action from the action space and is connected with the bottom layer planning control module, and the bottom layer planning control module is responsible for action execution.

According to the above method, in the step S2, driving demonstration data under different driving scenes is obtained by the following method:

a1, initializing a vehicle state in different driving scenes;

a2, in each decision period, manually judging the traffic state of the automatic driving automobile and making expert decisions, wherein the planning control module is responsible for decision action execution;

a3, calculating expert action rewards according to the set rewarding function;

a4, the data of the current state, action, rewards, next state and termination of the decision period are used as a Markov decision information chain (s _i ,a _i ,r _i ,s _i+1 ,d _i ) As driving presentation data in different driving scenarios.

According to the method, the S2 specifically comprises the following steps:

s2.1, normalizing a loss function, including time sequence differential loss, supervision marginal classification loss and parameter regularization loss, and adjusting weights of the three;

s2.2, importing driving demonstration data under different driving scenes;

s2.3, extracting a batch sample, calculating a demonstration action Q value based on the Markov decision information chain, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model;

s2.4, repeating the step S2.3, completing neural network training of a certain number of decision periods, realizing initial fitting of the Q value of the demonstration action, and finally obtaining a pre-training model.

The loss function J (Q) comprises time sequence differential loss, supervision marginal classification loss and parameter regularization loss according to the method, and the time sequence differential loss, supervision marginal classification loss and parameter regularization loss are as follows:

J(Q)＝J _DQ (Q)+λ ₁ J _E (Q)+λ ₂ J _L2 (Q)

wherein J _DQ (Q)、J _E (Q)、J _L2 (Q) represents the time sequence differential loss, the supervision marginal classification loss and the parameter L2 regularization loss, lambda respectively ₁ And lambda (lambda) ₂ Weights representing supervised marginal classification loss and parametric L2 regularization loss, L2 being the euclidean norm of the network weights.

According to the method, the step S3 specifically comprises the following steps:

s3.1, importing and loading driving demonstration data under different driving scenes to an expert data experience playback pool;

s3.2, initializing greedy strategy exploration coefficients;

s3.3, initializing an environment and a vehicle state;

s3.4 based on the current state S of the vehicle _i Selecting action a between optimal action estimated by pre-training model and random action by greedy strategy _i ；

S3.5 at the current state S of the vehicle _i Take action a _i Then a new state s is obtained _i+1 ；

S3.6, according to the set rewarding function, the action a is paired _i Evaluating to obtain rewards r corresponding to the current state and the action of the vehicle _i ；

S3.7, the data of the current state, action, rewards, next state and termination of the online interaction process formed by S3.4-S3.6 are used as a Markov decision information chain (S _i ,a _i ,r _i ,s _i+1 ,d _i ) The form of the interactive data is stored as interactive data, and the interactive data is stored in an interactive data experience playback pool;

s3.8, sampling according to a certain proportion in a home data experience playback pool and an interactive data experience playback pool to form a training batch, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model, wherein the interactive data does not calculate supervision marginal classification loss;

s3.9, repeating the steps from S3.4 to S3.8, and stopping when collision occurs or the stopping time is reached, and starting new round training from S3.3 until the model converges after stopping to obtain the final reinforcement learning decision model.

According to the method, the step S4 specifically comprises the following steps:

s4.1, calling a final reinforcement learning decision model corresponding to the current scene;

s4.2, acquiring a real-time traffic state of the automatic driving automobile and inputting the real-time traffic state into a final reinforcement learning decision model;

s4.3, selecting an optimal decision action from the action space by the final reinforcement learning decision model;

s4.4, inputting the optimal decision action into the bottom planning control module, analyzing and executing, and updating the vehicle state;

s4.5, repeating the steps S4.2 to S4.4, and finishing the decision of the automatic driving automobile.

According to the method, the real-time traffic state of the automatic driving automobile comprises the longitudinal and transverse positions of the automatic driving automobile, the longitudinal and transverse speeds of the automatic driving automobile, and the longitudinal and transverse relative positions and speeds of the automatic driving automobile and surrounding automobiles in a perception range; the action space comprises acceleration action and left-right channel changing action.

According to the method, in the bottom layer planning control module, the control period is taken as 1/10 of the decision period.

In the above method, the reward functions include collision risk assessment rewards, speed rewards, comfort rewards, and mission rewards.

A computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.

The beneficial effects of the invention are as follows:

1. according to the method, the reinforcement learning decision model is pre-trained by utilizing a learning imitation method based on expert data, the initialization of the model is guided, the early-stage exploration strategy is restrained, the early-stage exploration efficiency and the model convergence speed are improved, and the time-consuming problem of the reinforcement learning facing the strong interaction scene is solved. The initial verification result of the invention shows that the automatic driving automobile decision method based on the simulated learning and the discrete reinforcement learning has larger improvement on the convergence speed and the convergence effect compared with the single reinforcement learning method.

2. The method can effectively utilize natural driving data in reality, process the driving data to obtain a Markov decision information chain, pretrain the reinforcement learning model based on the Markov decision information chain, fully utilize the existing data to perform predictive data driving on the decision model, improve the convergence rate of the model, and can also perform targeted predictive data driving by acquiring the driving data through data acquisition aiming at a specific scene, thereby solving the problem of effectively utilizing high-quality natural driving data in reality.

3. The invention adopts the DDQN to construct the decision model, and the output discrete action can provide decision guidance for the bottom layer regulation module, so that the decision model is convenient to combine with the mature bottom layer regulation, the decision model can obtain the execution feedback of the bottom layer through state acquisition, the applicability and the decision safety of the decision model are ensured, and the problems of low applicability and low safety of the end-to-end reinforcement learning method for making large-span decisions from the sensing end to the execution end are avoided.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the invention.

FIG. 2 is a graph of average bonus training compared to the prior art in an embodiment of the invention.

FIG. 3 is a graph of average speed training in comparison to the prior art in an embodiment of the present invention.

FIG. 4 is a graph comparing test results of a prior art decision model according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to specific examples and figures.

The invention provides an automatic driving automobile decision method based on imitation learning and discrete reinforcement learning, which comprises the following steps of:

s1, constructing reinforcement learning decision models under different driving scenes by using a neural network based on reinforcement learning DDQN algorithm for an automatic driving automobile, and carrying out random initialization; different reward functions are designed for different driving scenarios.

Specifically, reinforcement learning decision models under different driving scenes are built, wherein the driving scenes can comprise scenes such as driving with a vehicle, driving on a curve, driving on an up-down ramp and the like, different rewarding functions are designed for the different driving scenes, and include collision risk assessment rewards, speed rewards, comfort rewards and task rewards, such as driving with the vehicle, focusing on collision risk assessment rewards and speed rewards, driving on the curve, focusing on comfort rewards, and driving on the up-down ramp, focusing on the task rewards.

S2, driving demonstration data under different driving scenes are obtained, and the reinforcement learning decision model is subjected to imitation learning pre-training based on the driving demonstration data to obtain a pre-training model.

The driving demonstration data under different driving scenes are obtained by the following modes:

a1, initializing a vehicle state in different driving scenes;

a3, calculating expert action rewards according to the set rewarding function;

a4, the current state s of the decision period _i Action a _i Prize r _i Next state s _i+1 Whether or not to terminate d _i Is represented by a Markov decision information chain (s _i ,a _i ,r _i ,s _i+1 ,d _i ) As driving presentation data in different driving scenarios. Where di denotes whether the ith decision period has expired.

In this embodiment, the driving demonstration data in different driving scenes has 20 ten thousand markov decision information chains.

The S2 specifically comprises the following steps:

s2.1, normalizing a loss function, including time sequence differential loss, supervision marginal classification loss and parameter regularization loss, and adjusting weights of the three. The loss function J (Q) comprises a time sequence difference loss, a supervision marginal classification loss and a parameter regularization loss, and is as follows:

J(Q)＝J _DQ (Q)+λ ₁ J _E (Q)+λ ₂ J _L2 (Q)

wherein J _DQ (Q)、J _E (Q)、J _L2 (Q) represents the time sequence differential loss, the supervision marginal classification loss and the parameter L2 regularization loss, lambda respectively ₁ And lambda (lambda) ₂ Weights representing the supervised marginal classification loss and the parameter L2 regularization loss are set to 1.0 and 1×10, respectively ^-6 . L2 is the euclidean norm of the network weight.

S2.2, importing driving demonstration data under different driving scenes;

and S2.3, extracting a batch sample, calculating a demonstration action Q value based on the Markov decision information chain, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model.

S2.4, repeating the step S2.3, completing neural network training of a certain number of decision periods, realizing initial fitting of the Q value of the demonstration action, and finally obtaining a pre-training model. In this embodiment, 100 ten thousand steps of neural network training are completed.

And S3, performing online interactive training on the pre-training model based on the reward function, and further optimizing the reinforcement learning decision model until the reinforcement learning decision model is completely converged to obtain a final reinforcement learning decision model. The method comprises the following steps:

s3.2, initializing greedy strategy exploration coefficients, wherein 0.5 is taken in the embodiment;

s3.3, initializing an environment and a vehicle state;

And before the action of the decision model is selected, the most basic safety action space screening is carried out by utilizing the safety rules so as to ensure the safety of the selected action. The safety rule adopts collision time and workshop time interval as safety factors to verify actions, and safety verification is carried out on the lane change instruction according to the road boundary.

And S4, when a specific decision is made, a corresponding final reinforcement learning decision model is called for different driving scenes, the real-time traffic state of the automatic driving automobile is obtained and is used as the input of the final reinforcement learning decision model, the final reinforcement learning decision model selects the best decision action from the action space and is connected with the bottom layer planning control module, and the bottom layer planning control module is responsible for action execution. The method comprises the following steps:

s4.1, calling a final reinforcement learning decision model corresponding to the current scene.

S4.2, obtaining the automatic driving steamThe real-time traffic state of the vehicle is input into a final reinforcement learning decision model; the real-time traffic state of the automatic driving automobile comprises the longitudinal and transverse positions of the automatic driving automobile, the longitudinal and transverse speeds of the automatic driving automobile, and the longitudinal and transverse relative positions and speeds of surrounding automobiles in the sensing range of the automatic driving automobile; the motion space comprises acceleration motion (at-2 m/s ² ，-1m/s ² ,0m/s ² ,1m/s ² ,2m/s ² Five accelerations are examples) and left and right lane changing actions.

S4.3, selecting the optimal decision action from the action space by the final reinforcement learning decision model.

And S4.4, inputting the optimal decision action into the bottom planning control module, analyzing and executing, and updating the vehicle state. In the bottom layer planning control module, the control period is taken as 1/10 of the decision period.

In order to perform preliminary verification on the method provided by the embodiment of the invention, a simulation test is performed by taking a ramp scene as an example, the simulation scene is an overhead road of a section of four lanes of Town04 in a CARALA simulation platform, the lane width is 3.5m, and 400m of the overhead road is taken for research. The automatic driving automobile generates two leftmost lanes at 300m in front of the lower lane opening, and the target point is the rightmost lower lane opening, so that the automatic driving automobile needs to gradually change lanes to the right to interact with the rightmost lanes, the side automobile randomly generates 200m in front of and behind the automatic driving automobile, and the decision step length and the control step length are respectively 500ms and 50ms. The specific simulation steps comprise:

(1) Step 1, designing a reward function of a current down-ramp scene, wherein the reward function comprises collision risk assessment rewards, speed rewards, comfort related rewards and task target rewards, and the task target rewards comprise transverse deviation rewards and rewards of task completion.

(2) And 2, carrying out human decision making on a vehicle control end of the simulation platform based on the current state of the automatic driving automobile, collecting decision data, and generating an expert data set.

(3) And 3, constructing a reinforcement learning decision model, and performing simulated learning pre-training on the model by using the expert data set to obtain a pre-training model.

(4) And step 4, loading a pre-training model, and performing online simulation interaction in a simulation platform to further optimize the reinforcement learning decision model until complete convergence.

(5) And 5, loading a decision convergence model, acquiring the vehicle state, inputting the vehicle state into the model, and testing the model under the scene.

Specifically, the specific reward design of the simulation verification scene in the embodiment of the invention is as follows:

(1) collision risk assessment rewards:

R _collision ＝-200if collision

where TTC and THW represent time to collision and time to headway, respectively.

(2) Speed rewards:

R _v ＝-2×max(v _desired -v _ego ,0)/v _des

wherein v is _ego Indicating the current speed (m/s), v of an autonomous car _des The expected speed (m/s) of the autonomous car is indicated, here taken as 60km/h, based on the average speed of the vehicle under the flow.

(3) Comfort rewards:

wherein a represents the acceleration selected by the decision model of the automatic driving automobile.

(4) Task rewards:

R _y ＝-2×abs(y _current -y _target )/dev _max

wherein y is _current Indicating the current lateral position of the automatic driving automobile, y _targe t represents the lateral position of the target point, dev _max Representing the maximum value of the lateral deviation.

The final bonus function is normalized as follows:

R _total ＝(R _TTC +R _THW +R _v +R _a +R _y +R _over )/10

based on the above steps, simulation verification is performed on the above scenes, and result statistics is performed, and fig. 2 to fig. 4 are graphs of initial verification results of an automatic driving automobile decision method based on simulated learning and discrete reinforcement learning in the embodiment of the invention, where the statistical results include average rewards and average speeds in training stages, and success rates, collision rates and average speeds in testing stages. Fig. 2-4 show that, compared with the single reinforcement learning method, the automatic driving automobile decision method based on the simulation learning and reinforcement learning in the embodiment of the invention has larger improvement in convergence speed and convergence effect, the early exploration rate in the model training period is improved, the average speed of the automobile is higher, and meanwhile, the performance of the automatic driving automobile decision method in the embodiment of the invention in the test period in the scene is greatly improved compared with other methods, so that initial verification is provided for the effect of the automatic driving automobile decision method in the embodiment of the invention. After the initial pre-training of the simulated learning and the reinforcement learning is added, the decision model can be well initialized and the exploration strategy is restrained, the early exploration efficiency of the reinforcement learning model can be further improved, the convergence speed and the convergence effect can be greatly improved, and meanwhile, the combination capability of the decision model and the underlying regulation module is ensured through the discrete action guidance of the discrete reinforcement learning output.

The invention also provides a computer storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

It will be understood that modifications and variations will be apparent to those skilled in the art from the foregoing description, and it is intended that all such modifications and variations be included within the scope of the following claims.

Claims

1. An automatic driving automobile decision method based on imitation learning and discrete reinforcement learning is characterized in that: the method comprises the following steps:

2. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 1, wherein: in the step S2, driving demonstration data under different driving scenes are obtained by the following modes:

a1, initializing a vehicle state in different driving scenes;

a3, calculating expert action rewards according to the set rewarding function;

3. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 1, wherein: the S2 specifically comprises the following steps:

s2.2, importing driving demonstration data under different driving scenes;

4. An automated driving car decision method based on imitation learning and discrete reinforcement learning according to claim 3, characterized in that: the loss function J (Q) comprises a time sequence difference loss, a supervision marginal classification loss and a parameter regularization loss, and is as follows:

J(Q)＝J _DQ (Q)+λ ₁ J _E (Q)+λ ₂ J _L2 (Q)

wherein J _DQ (Q)、J _E (Q)、J _L2 (Q) represents the time sequence differential loss and the supervision margin respectivelyClassification loss and parameter L2 regularization loss, λ ₁ And lambda (lambda) ₂ Weights representing supervised marginal classification loss and parametric L2 regularization loss, L2 being the euclidean norm of the network weights.

5. An automated driving car decision method based on imitation learning and discrete reinforcement learning according to claim 3 or 4, characterized in that: the S3 specifically comprises the following steps:

s3.2, initializing greedy strategy exploration coefficients;

s3.3, initializing an environment and a vehicle state;

6. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 1, wherein: the S4 specifically comprises the following steps:

7. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 6, wherein: the real-time traffic state of the automatic driving automobile comprises the longitudinal and transverse positions of the automatic driving automobile, the longitudinal and transverse speeds of the automatic driving automobile, and the longitudinal and transverse relative positions and speeds of surrounding automobiles in the sensing range of the automatic driving automobile; the action space comprises acceleration action and left-right channel changing action.

8. An automated driving car decision method based on imitation learning and discrete reinforcement learning according to claim 6 or 7, characterized in that: in the bottom layer planning control module, the control period is taken as 1/10 of the decision period.

9. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 1, wherein: the reward functions include collision risk assessment rewards, speed rewards, comfort rewards, and mission rewards.

10. A computer storage medium having a computer program stored thereon, characterized by: the computer program, when executed by a processor, implements the steps of the method of any of the preceding claims 1 to 9.