[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN117610681A - Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning - Google Patents

Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning Download PDF

Info

Publication number
CN117610681A
CN117610681A CN202311623676.1A CN202311623676A CN117610681A CN 117610681 A CN117610681 A CN 117610681A CN 202311623676 A CN202311623676 A CN 202311623676A CN 117610681 A CN117610681 A CN 117610681A
Authority
CN
China
Prior art keywords
decision
reinforcement learning
action
learning
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311623676.1A
Other languages
Chinese (zh)
Inventor
裴晓飞
杨哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202311623676.1A priority Critical patent/CN117610681A/en
Publication of CN117610681A publication Critical patent/CN117610681A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Traffic Control Systems (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention provides an automatic driving automobile decision method based on imitation learning and discrete reinforcement learning, which is based on reinforcement learning DDQN algorithm, and utilizes a neural network to build reinforcement learning decision models under different driving scenes; designing different reward functions for different driving scenes; performing simulated learning pre-training on the reinforcement learning decision model based on driving demonstration data in different driving scenes to obtain a pre-training model; performing online interactive training on the pre-training model based on the reward function to obtain a final reinforcement learning decision model; and when a specific decision is made, a corresponding final reinforcement learning decision model is called for different driving scenes, the optimal decision action is selected from the action space and is connected with the bottom layer planning control module, and the bottom layer planning control module is responsible for action execution. The invention can improve the automatic driving decision performance.

Description

Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning
Technical Field
The invention belongs to the technical field of automatic driving, and particularly relates to an automatic driving automobile decision method based on imitation learning and discrete reinforcement learning.
Background
The decision body is an important research content and key technology in the intelligent traffic system, and can replace a driver to make judgment in time in a complex environment. The safe and effective decision model not only can ensure the driving safety and reduce the accident rate, but also can relieve the traffic pressure to a certain extent and improve the traffic efficiency. Thus, it becomes important to construct a reliable, efficient decision model. Most of the existing decisions are formulated based on rules, and the existing decisions have the characteristics of being interpretable, easy to build, mature and the like, but are difficult to adapt to all working conditions possibly encountered in the running process of the vehicle, so that decision errors can be caused, and decision disasters are caused. Whereas in recent years reinforcement learning based decision methods have shown adaptability to an uncertainty environment, they require a large amount of data to train to obtain robustness of the decision, thereby improving the performance of the decision.
Training is based on data that interacts with the environment, however, which is often time consuming. Especially for task scenes, such as complex scenes of up-down ramps, roundabout and the like, a large amount of high-quality interaction data and long training time are required, if self-exploration by reinforcement learning alone is very inefficient, and the data obtained by reinforcement learning exploration cannot be guaranteed to be high-quality, which also affects the performance of the decision model.
Disclosure of Invention
The invention aims to solve the technical problems that: the automatic driving automobile decision method based on the imitation learning and the discrete reinforcement learning is provided, and the automatic driving decision performance is improved.
The technical scheme adopted by the invention for solving the technical problems is as follows: an automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning, comprising the following steps:
s1, constructing reinforcement learning decision models under different driving scenes by using a neural network based on reinforcement learning DDQN algorithm for an automatic driving automobile, and carrying out random initialization; designing different reward functions for different driving scenes;
s2, driving demonstration data under different driving scenes are obtained, and simulation learning pre-training is conducted on the reinforcement learning decision model based on the driving demonstration data to obtain a pre-training model;
s3, performing online interactive training on the pre-training model based on the reward function, and further optimizing the reinforcement learning decision model until the reinforcement learning decision model is completely converged to obtain a final reinforcement learning decision model;
and S4, when a specific decision is made, a corresponding final reinforcement learning decision model is called for different driving scenes, the real-time traffic state of the automatic driving automobile is obtained and is used as the input of the final reinforcement learning decision model, the final reinforcement learning decision model selects the best decision action from the action space and is connected with the bottom layer planning control module, and the bottom layer planning control module is responsible for action execution.
According to the above method, in the step S2, driving demonstration data under different driving scenes is obtained by the following method:
a1, initializing a vehicle state in different driving scenes;
a2, in each decision period, manually judging the traffic state of the automatic driving automobile and making expert decisions, wherein the planning control module is responsible for decision action execution;
a3, calculating expert action rewards according to the set rewarding function;
a4, the data of the current state, action, rewards, next state and termination of the decision period are used as a Markov decision information chain (s i ,a i ,r i ,s i+1 ,d i ) As driving presentation data in different driving scenarios.
According to the method, the S2 specifically comprises the following steps:
s2.1, normalizing a loss function, including time sequence differential loss, supervision marginal classification loss and parameter regularization loss, and adjusting weights of the three;
s2.2, importing driving demonstration data under different driving scenes;
s2.3, extracting a batch sample, calculating a demonstration action Q value based on the Markov decision information chain, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model;
s2.4, repeating the step S2.3, completing neural network training of a certain number of decision periods, realizing initial fitting of the Q value of the demonstration action, and finally obtaining a pre-training model.
The loss function J (Q) comprises time sequence differential loss, supervision marginal classification loss and parameter regularization loss according to the method, and the time sequence differential loss, supervision marginal classification loss and parameter regularization loss are as follows:
J(Q)=J DQ (Q)+λ 1 J E (Q)+λ 2 J L2 (Q)
wherein J DQ (Q)、J E (Q)、J L2 (Q) represents the time sequence differential loss, the supervision marginal classification loss and the parameter L2 regularization loss, lambda respectively 1 And lambda (lambda) 2 Weights representing supervised marginal classification loss and parametric L2 regularization loss, L2 being the euclidean norm of the network weights.
According to the method, the step S3 specifically comprises the following steps:
s3.1, importing and loading driving demonstration data under different driving scenes to an expert data experience playback pool;
s3.2, initializing greedy strategy exploration coefficients;
s3.3, initializing an environment and a vehicle state;
s3.4 based on the current state S of the vehicle i Selecting action a between optimal action estimated by pre-training model and random action by greedy strategy i
S3.5 at the current state S of the vehicle i Take action a i Then a new state s is obtained i+1
S3.6, according to the set rewarding function, the action a is paired i Evaluating to obtain rewards r corresponding to the current state and the action of the vehicle i
S3.7, the data of the current state, action, rewards, next state and termination of the online interaction process formed by S3.4-S3.6 are used as a Markov decision information chain (S i ,a i ,r i ,s i+1 ,d i ) The form of the interactive data is stored as interactive data, and the interactive data is stored in an interactive data experience playback pool;
s3.8, sampling according to a certain proportion in a home data experience playback pool and an interactive data experience playback pool to form a training batch, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model, wherein the interactive data does not calculate supervision marginal classification loss;
s3.9, repeating the steps from S3.4 to S3.8, and stopping when collision occurs or the stopping time is reached, and starting new round training from S3.3 until the model converges after stopping to obtain the final reinforcement learning decision model.
According to the method, the step S4 specifically comprises the following steps:
s4.1, calling a final reinforcement learning decision model corresponding to the current scene;
s4.2, acquiring a real-time traffic state of the automatic driving automobile and inputting the real-time traffic state into a final reinforcement learning decision model;
s4.3, selecting an optimal decision action from the action space by the final reinforcement learning decision model;
s4.4, inputting the optimal decision action into the bottom planning control module, analyzing and executing, and updating the vehicle state;
s4.5, repeating the steps S4.2 to S4.4, and finishing the decision of the automatic driving automobile.
According to the method, the real-time traffic state of the automatic driving automobile comprises the longitudinal and transverse positions of the automatic driving automobile, the longitudinal and transverse speeds of the automatic driving automobile, and the longitudinal and transverse relative positions and speeds of the automatic driving automobile and surrounding automobiles in a perception range; the action space comprises acceleration action and left-right channel changing action.
According to the method, in the bottom layer planning control module, the control period is taken as 1/10 of the decision period.
In the above method, the reward functions include collision risk assessment rewards, speed rewards, comfort rewards, and mission rewards.
A computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.
The beneficial effects of the invention are as follows:
1. according to the method, the reinforcement learning decision model is pre-trained by utilizing a learning imitation method based on expert data, the initialization of the model is guided, the early-stage exploration strategy is restrained, the early-stage exploration efficiency and the model convergence speed are improved, and the time-consuming problem of the reinforcement learning facing the strong interaction scene is solved. The initial verification result of the invention shows that the automatic driving automobile decision method based on the simulated learning and the discrete reinforcement learning has larger improvement on the convergence speed and the convergence effect compared with the single reinforcement learning method.
2. The method can effectively utilize natural driving data in reality, process the driving data to obtain a Markov decision information chain, pretrain the reinforcement learning model based on the Markov decision information chain, fully utilize the existing data to perform predictive data driving on the decision model, improve the convergence rate of the model, and can also perform targeted predictive data driving by acquiring the driving data through data acquisition aiming at a specific scene, thereby solving the problem of effectively utilizing high-quality natural driving data in reality.
3. The invention adopts the DDQN to construct the decision model, and the output discrete action can provide decision guidance for the bottom layer regulation module, so that the decision model is convenient to combine with the mature bottom layer regulation, the decision model can obtain the execution feedback of the bottom layer through state acquisition, the applicability and the decision safety of the decision model are ensured, and the problems of low applicability and low safety of the end-to-end reinforcement learning method for making large-span decisions from the sensing end to the execution end are avoided.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the invention.
FIG. 2 is a graph of average bonus training compared to the prior art in an embodiment of the invention.
FIG. 3 is a graph of average speed training in comparison to the prior art in an embodiment of the present invention.
FIG. 4 is a graph comparing test results of a prior art decision model according to an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to specific examples and figures.
The invention provides an automatic driving automobile decision method based on imitation learning and discrete reinforcement learning, which comprises the following steps of:
s1, constructing reinforcement learning decision models under different driving scenes by using a neural network based on reinforcement learning DDQN algorithm for an automatic driving automobile, and carrying out random initialization; different reward functions are designed for different driving scenarios.
Specifically, reinforcement learning decision models under different driving scenes are built, wherein the driving scenes can comprise scenes such as driving with a vehicle, driving on a curve, driving on an up-down ramp and the like, different rewarding functions are designed for the different driving scenes, and include collision risk assessment rewards, speed rewards, comfort rewards and task rewards, such as driving with the vehicle, focusing on collision risk assessment rewards and speed rewards, driving on the curve, focusing on comfort rewards, and driving on the up-down ramp, focusing on the task rewards.
S2, driving demonstration data under different driving scenes are obtained, and the reinforcement learning decision model is subjected to imitation learning pre-training based on the driving demonstration data to obtain a pre-training model.
The driving demonstration data under different driving scenes are obtained by the following modes:
a1, initializing a vehicle state in different driving scenes;
a2, in each decision period, manually judging the traffic state of the automatic driving automobile and making expert decisions, wherein the planning control module is responsible for decision action execution;
a3, calculating expert action rewards according to the set rewarding function;
a4, the current state s of the decision period i Action a i Prize r i Next state s i+1 Whether or not to terminate d i Is represented by a Markov decision information chain (s i ,a i ,r i ,s i+1 ,d i ) As driving presentation data in different driving scenarios. Where di denotes whether the ith decision period has expired.
In this embodiment, the driving demonstration data in different driving scenes has 20 ten thousand markov decision information chains.
The S2 specifically comprises the following steps:
s2.1, normalizing a loss function, including time sequence differential loss, supervision marginal classification loss and parameter regularization loss, and adjusting weights of the three. The loss function J (Q) comprises a time sequence difference loss, a supervision marginal classification loss and a parameter regularization loss, and is as follows:
J(Q)=J DQ (Q)+λ 1 J E (Q)+λ 2 J L2 (Q)
wherein J DQ (Q)、J E (Q)、J L2 (Q) represents the time sequence differential loss, the supervision marginal classification loss and the parameter L2 regularization loss, lambda respectively 1 And lambda (lambda) 2 Weights representing the supervised marginal classification loss and the parameter L2 regularization loss are set to 1.0 and 1×10, respectively -6 . L2 is the euclidean norm of the network weight.
S2.2, importing driving demonstration data under different driving scenes;
and S2.3, extracting a batch sample, calculating a demonstration action Q value based on the Markov decision information chain, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model.
S2.4, repeating the step S2.3, completing neural network training of a certain number of decision periods, realizing initial fitting of the Q value of the demonstration action, and finally obtaining a pre-training model. In this embodiment, 100 ten thousand steps of neural network training are completed.
And S3, performing online interactive training on the pre-training model based on the reward function, and further optimizing the reinforcement learning decision model until the reinforcement learning decision model is completely converged to obtain a final reinforcement learning decision model. The method comprises the following steps:
s3.1, importing and loading driving demonstration data under different driving scenes to an expert data experience playback pool;
s3.2, initializing greedy strategy exploration coefficients, wherein 0.5 is taken in the embodiment;
s3.3, initializing an environment and a vehicle state;
s3.4 based on the current state S of the vehicle i Selecting action a between optimal action estimated by pre-training model and random action by greedy strategy i
S3.5 at the current state S of the vehicle i Take action a i Then a new state s is obtained i+1
S3.6, according to the set rewarding function, the action a is paired i Evaluating to obtain rewards r corresponding to the current state and the action of the vehicle i
S3.7, the data of the current state, action, rewards, next state and termination of the online interaction process formed by S3.4-S3.6 are used as a Markov decision information chain (S i ,a i ,r i ,s i+1 ,d i ) The form of the interactive data is stored as interactive data, and the interactive data is stored in an interactive data experience playback pool;
s3.8, sampling according to a certain proportion in a home data experience playback pool and an interactive data experience playback pool to form a training batch, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model, wherein the interactive data does not calculate supervision marginal classification loss;
s3.9, repeating the steps from S3.4 to S3.8, and stopping when collision occurs or the stopping time is reached, and starting new round training from S3.3 until the model converges after stopping to obtain the final reinforcement learning decision model.
And before the action of the decision model is selected, the most basic safety action space screening is carried out by utilizing the safety rules so as to ensure the safety of the selected action. The safety rule adopts collision time and workshop time interval as safety factors to verify actions, and safety verification is carried out on the lane change instruction according to the road boundary.
And S4, when a specific decision is made, a corresponding final reinforcement learning decision model is called for different driving scenes, the real-time traffic state of the automatic driving automobile is obtained and is used as the input of the final reinforcement learning decision model, the final reinforcement learning decision model selects the best decision action from the action space and is connected with the bottom layer planning control module, and the bottom layer planning control module is responsible for action execution. The method comprises the following steps:
s4.1, calling a final reinforcement learning decision model corresponding to the current scene.
S4.2, obtaining the automatic driving steamThe real-time traffic state of the vehicle is input into a final reinforcement learning decision model; the real-time traffic state of the automatic driving automobile comprises the longitudinal and transverse positions of the automatic driving automobile, the longitudinal and transverse speeds of the automatic driving automobile, and the longitudinal and transverse relative positions and speeds of surrounding automobiles in the sensing range of the automatic driving automobile; the motion space comprises acceleration motion (at-2 m/s 2 ,-1m/s 2 ,0m/s 2 ,1m/s 2 ,2m/s 2 Five accelerations are examples) and left and right lane changing actions.
S4.3, selecting the optimal decision action from the action space by the final reinforcement learning decision model.
And S4.4, inputting the optimal decision action into the bottom planning control module, analyzing and executing, and updating the vehicle state. In the bottom layer planning control module, the control period is taken as 1/10 of the decision period.
S4.5, repeating the steps S4.2 to S4.4, and finishing the decision of the automatic driving automobile.
In order to perform preliminary verification on the method provided by the embodiment of the invention, a simulation test is performed by taking a ramp scene as an example, the simulation scene is an overhead road of a section of four lanes of Town04 in a CARALA simulation platform, the lane width is 3.5m, and 400m of the overhead road is taken for research. The automatic driving automobile generates two leftmost lanes at 300m in front of the lower lane opening, and the target point is the rightmost lower lane opening, so that the automatic driving automobile needs to gradually change lanes to the right to interact with the rightmost lanes, the side automobile randomly generates 200m in front of and behind the automatic driving automobile, and the decision step length and the control step length are respectively 500ms and 50ms. The specific simulation steps comprise:
(1) Step 1, designing a reward function of a current down-ramp scene, wherein the reward function comprises collision risk assessment rewards, speed rewards, comfort related rewards and task target rewards, and the task target rewards comprise transverse deviation rewards and rewards of task completion.
(2) And 2, carrying out human decision making on a vehicle control end of the simulation platform based on the current state of the automatic driving automobile, collecting decision data, and generating an expert data set.
(3) And 3, constructing a reinforcement learning decision model, and performing simulated learning pre-training on the model by using the expert data set to obtain a pre-training model.
(4) And step 4, loading a pre-training model, and performing online simulation interaction in a simulation platform to further optimize the reinforcement learning decision model until complete convergence.
(5) And 5, loading a decision convergence model, acquiring the vehicle state, inputting the vehicle state into the model, and testing the model under the scene.
Specifically, the specific reward design of the simulation verification scene in the embodiment of the invention is as follows:
(1) collision risk assessment rewards:
R collision =-200if collision
where TTC and THW represent time to collision and time to headway, respectively.
(2) Speed rewards:
R v =-2×max(v desired -v ego ,0)/v des
wherein v is ego Indicating the current speed (m/s), v of an autonomous car des The expected speed (m/s) of the autonomous car is indicated, here taken as 60km/h, based on the average speed of the vehicle under the flow.
(3) Comfort rewards:
wherein a represents the acceleration selected by the decision model of the automatic driving automobile.
(4) Task rewards:
R y =-2×abs(y current -y target )/dev max
wherein y is current Indicating the current lateral position of the automatic driving automobile, y targe t represents the lateral position of the target point, dev max Representing the maximum value of the lateral deviation.
The final bonus function is normalized as follows:
R total =(R TTC +R THW +R v +R a +R y +R over )/10
based on the above steps, simulation verification is performed on the above scenes, and result statistics is performed, and fig. 2 to fig. 4 are graphs of initial verification results of an automatic driving automobile decision method based on simulated learning and discrete reinforcement learning in the embodiment of the invention, where the statistical results include average rewards and average speeds in training stages, and success rates, collision rates and average speeds in testing stages. Fig. 2-4 show that, compared with the single reinforcement learning method, the automatic driving automobile decision method based on the simulation learning and reinforcement learning in the embodiment of the invention has larger improvement in convergence speed and convergence effect, the early exploration rate in the model training period is improved, the average speed of the automobile is higher, and meanwhile, the performance of the automatic driving automobile decision method in the embodiment of the invention in the test period in the scene is greatly improved compared with other methods, so that initial verification is provided for the effect of the automatic driving automobile decision method in the embodiment of the invention. After the initial pre-training of the simulated learning and the reinforcement learning is added, the decision model can be well initialized and the exploration strategy is restrained, the early exploration efficiency of the reinforcement learning model can be further improved, the convergence speed and the convergence effect can be greatly improved, and meanwhile, the combination capability of the decision model and the underlying regulation module is ensured through the discrete action guidance of the discrete reinforcement learning output.
The invention also provides a computer storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.
It will be understood that modifications and variations will be apparent to those skilled in the art from the foregoing description, and it is intended that all such modifications and variations be included within the scope of the following claims.

Claims (10)

1. An automatic driving automobile decision method based on imitation learning and discrete reinforcement learning is characterized in that: the method comprises the following steps:
s1, constructing reinforcement learning decision models under different driving scenes by using a neural network based on reinforcement learning DDQN algorithm for an automatic driving automobile, and carrying out random initialization; designing different reward functions for different driving scenes;
s2, driving demonstration data under different driving scenes are obtained, and simulation learning pre-training is conducted on the reinforcement learning decision model based on the driving demonstration data to obtain a pre-training model;
s3, performing online interactive training on the pre-training model based on the reward function, and further optimizing the reinforcement learning decision model until the reinforcement learning decision model is completely converged to obtain a final reinforcement learning decision model;
and S4, when a specific decision is made, a corresponding final reinforcement learning decision model is called for different driving scenes, the real-time traffic state of the automatic driving automobile is obtained and is used as the input of the final reinforcement learning decision model, the final reinforcement learning decision model selects the best decision action from the action space and is connected with the bottom layer planning control module, and the bottom layer planning control module is responsible for action execution.
2. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 1, wherein: in the step S2, driving demonstration data under different driving scenes are obtained by the following modes:
a1, initializing a vehicle state in different driving scenes;
a2, in each decision period, manually judging the traffic state of the automatic driving automobile and making expert decisions, wherein the planning control module is responsible for decision action execution;
a3, calculating expert action rewards according to the set rewarding function;
a4, the data of the current state, action, rewards, next state and termination of the decision period are used as a Markov decision information chain (s i ,a i ,r i ,s i+1 ,d i ) As driving presentation data in different driving scenarios.
3. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 1, wherein: the S2 specifically comprises the following steps:
s2.1, normalizing a loss function, including time sequence differential loss, supervision marginal classification loss and parameter regularization loss, and adjusting weights of the three;
s2.2, importing driving demonstration data under different driving scenes;
s2.3, extracting a batch sample, calculating a demonstration action Q value based on the Markov decision information chain, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model;
s2.4, repeating the step S2.3, completing neural network training of a certain number of decision periods, realizing initial fitting of the Q value of the demonstration action, and finally obtaining a pre-training model.
4. An automated driving car decision method based on imitation learning and discrete reinforcement learning according to claim 3, characterized in that: the loss function J (Q) comprises a time sequence difference loss, a supervision marginal classification loss and a parameter regularization loss, and is as follows:
J(Q)=J DQ (Q)+λ 1 J E (Q)+λ 2 J L2 (Q)
wherein J DQ (Q)、J E (Q)、J L2 (Q) represents the time sequence differential loss and the supervision margin respectivelyClassification loss and parameter L2 regularization loss, λ 1 And lambda (lambda) 2 Weights representing supervised marginal classification loss and parametric L2 regularization loss, L2 being the euclidean norm of the network weights.
5. An automated driving car decision method based on imitation learning and discrete reinforcement learning according to claim 3 or 4, characterized in that: the S3 specifically comprises the following steps:
s3.1, importing and loading driving demonstration data under different driving scenes to an expert data experience playback pool;
s3.2, initializing greedy strategy exploration coefficients;
s3.3, initializing an environment and a vehicle state;
s3.4 based on the current state S of the vehicle i Selecting action a between optimal action estimated by pre-training model and random action by greedy strategy i
S3.5 at the current state S of the vehicle i Take action a i Then a new state s is obtained i+1
S3.6, according to the set rewarding function, the action a is paired i Evaluating to obtain rewards r corresponding to the current state and the action of the vehicle i
S3.7, the data of the current state, action, rewards, next state and termination of the online interaction process formed by S3.4-S3.6 are used as a Markov decision information chain (S i ,a i ,r i ,s i+1 ,d i ) The form of the interactive data is stored as interactive data, and the interactive data is stored in an interactive data experience playback pool;
s3.8, sampling according to a certain proportion in a home data experience playback pool and an interactive data experience playback pool to form a training batch, calculating batch loss by using the loss function, performing gradient back propagation, and optimizing parameters of the reinforcement learning decision model, wherein the interactive data does not calculate supervision marginal classification loss;
s3.9, repeating the steps from S3.4 to S3.8, and stopping when collision occurs or the stopping time is reached, and starting new round training from S3.3 until the model converges after stopping to obtain the final reinforcement learning decision model.
6. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 1, wherein: the S4 specifically comprises the following steps:
s4.1, calling a final reinforcement learning decision model corresponding to the current scene;
s4.2, acquiring a real-time traffic state of the automatic driving automobile and inputting the real-time traffic state into a final reinforcement learning decision model;
s4.3, selecting an optimal decision action from the action space by the final reinforcement learning decision model;
s4.4, inputting the optimal decision action into the bottom planning control module, analyzing and executing, and updating the vehicle state;
s4.5, repeating the steps S4.2 to S4.4, and finishing the decision of the automatic driving automobile.
7. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 6, wherein: the real-time traffic state of the automatic driving automobile comprises the longitudinal and transverse positions of the automatic driving automobile, the longitudinal and transverse speeds of the automatic driving automobile, and the longitudinal and transverse relative positions and speeds of surrounding automobiles in the sensing range of the automatic driving automobile; the action space comprises acceleration action and left-right channel changing action.
8. An automated driving car decision method based on imitation learning and discrete reinforcement learning according to claim 6 or 7, characterized in that: in the bottom layer planning control module, the control period is taken as 1/10 of the decision period.
9. The automated driving car decision method based on imitation learning and discrete reinforcement learning of claim 1, wherein: the reward functions include collision risk assessment rewards, speed rewards, comfort rewards, and mission rewards.
10. A computer storage medium having a computer program stored thereon, characterized by: the computer program, when executed by a processor, implements the steps of the method of any of the preceding claims 1 to 9.
CN202311623676.1A 2023-11-28 2023-11-28 Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning Pending CN117610681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311623676.1A CN117610681A (en) 2023-11-28 2023-11-28 Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311623676.1A CN117610681A (en) 2023-11-28 2023-11-28 Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning

Publications (1)

Publication Number Publication Date
CN117610681A true CN117610681A (en) 2024-02-27

Family

ID=89947898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311623676.1A Pending CN117610681A (en) 2023-11-28 2023-11-28 Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning

Country Status (1)

Country Link
CN (1) CN117610681A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117933346A (en) * 2024-03-25 2024-04-26 之江实验室 Instant rewarding learning method based on self-supervision reinforcement learning
CN118113044A (en) * 2024-02-29 2024-05-31 中兵智能创新研究院有限公司 Cross-scene behavior decision system of ground unmanned platform
CN118656308A (en) * 2024-08-19 2024-09-17 中汽数据(天津)有限公司 Method for expanding safety test data of expected functions of Internet of vehicles

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118113044A (en) * 2024-02-29 2024-05-31 中兵智能创新研究院有限公司 Cross-scene behavior decision system of ground unmanned platform
CN117933346A (en) * 2024-03-25 2024-04-26 之江实验室 Instant rewarding learning method based on self-supervision reinforcement learning
CN118656308A (en) * 2024-08-19 2024-09-17 中汽数据(天津)有限公司 Method for expanding safety test data of expected functions of Internet of vehicles

Similar Documents

Publication Publication Date Title
CN117610681A (en) Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning
CN110297494B (en) Decision-making method and system for lane change of automatic driving vehicle based on rolling game
CN109733415B (en) Anthropomorphic automatic driving and following model based on deep reinforcement learning
CN112201069B (en) Deep reinforcement learning-based method for constructing longitudinal following behavior model of driver
CN111260027B (en) Intelligent agent automatic decision-making method based on reinforcement learning
CN113561986B (en) Automatic driving automobile decision making method and device
CN112172813B (en) Car following system and method for simulating driving style based on deep inverse reinforcement learning
CN113253739B (en) Driving behavior decision method for expressway
CN110525428B (en) Automatic parking method based on fuzzy depth reinforcement learning
CN114162146B (en) Driving strategy model training method and automatic driving control method
CN114358128A (en) Method for training end-to-end automatic driving strategy
CN113901718A (en) Deep reinforcement learning-based driving collision avoidance optimization method in following state
CN111348034B (en) Automatic parking method and system based on generation countermeasure simulation learning
CN114148349B (en) Vehicle personalized following control method based on generation of countermeasure imitation study
CN114372501A (en) Automatic driving training method, device, equipment, storage medium and program product
CN117872800A (en) Decision planning method based on reinforcement learning in discrete state space
CN116639124A (en) Automatic driving vehicle lane changing method based on double-layer deep reinforcement learning
CN116224996A (en) Automatic driving optimization control method based on countermeasure reinforcement learning
CN116052411A (en) Diversion area mixed traffic flow control method based on graph neural network reinforcement learning
CN116027788A (en) Intelligent driving behavior decision method and equipment integrating complex network theory and part of observable Markov decision process
CN114789729A (en) Lane cooperative control system and method based on driving style
CN114954498A (en) Reinforced learning lane change behavior planning method and system based on simulated learning initialization
CN118560530B (en) Multi-agent driving behavior modeling method based on generation of countermeasure imitation learning
CN118567372B (en) Unmanned aerial vehicle control method and system based on multi-expert simulation learning
CN117975190B (en) Method and device for processing simulated learning mixed sample based on vision pre-training model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination