[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114372501A - Automatic driving training method, device, equipment, storage medium and program product - Google Patents

Automatic driving training method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN114372501A
CN114372501A CN202111437745.0A CN202111437745A CN114372501A CN 114372501 A CN114372501 A CN 114372501A CN 202111437745 A CN202111437745 A CN 202111437745A CN 114372501 A CN114372501 A CN 114372501A
Authority
CN
China
Prior art keywords
training
sample
expert
driving
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111437745.0A
Other languages
Chinese (zh)
Inventor
詹仙园
李键雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202111437745.0A priority Critical patent/CN114372501A/en
Publication of CN114372501A publication Critical patent/CN114372501A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B9/00Simulators for teaching or training purposes
    • G09B9/02Simulators for teaching or training purposes for teaching control of vehicles or other craft
    • G09B9/04Simulators for teaching or training purposes for teaching control of vehicles or other craft for teaching control of land vehicles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Feedback Control In General (AREA)

Abstract

The present invention relates to the field of automatic driving technologies, and in particular, to an automatic driving training method, apparatus, device, storage medium, and program product. The method comprises the following steps: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error; and performing simulation training on the automatic driving vehicle through the training sample according to the driving simulation strategy after the weight is adjusted. The automatic driving vehicle training system is used for solving the problems of poor safety and low efficiency in the prior art when the automatic driving vehicle is trained, and realizing the safe and efficient training of the automatic driving vehicle.

Description

Automatic driving training method, device, equipment, storage medium and program product
Technical Field
The present invention relates to the field of automatic driving technologies, and in particular, to an automatic driving training method, apparatus, device, storage medium, and program product.
Background
With the development of science and technology, automatic driving gradually comes into the field of vision of people. In order to ensure the safety of automatic driving, the training process of the automatic driving vehicle is very important. In the prior art, a large number of training samples are required for training an automatic driving vehicle, and the training samples comprise historical driving data of drivers with abundant experience and low accident rate, historical driving data of drivers with insufficient experience and high accident rate, and historical driving data of automatic driving systems needing manual taking over. However, because the historical data of the driving of the drivers with short experience and high accident rate occupies a large proportion of the training samples, when the training samples are used for simulation training of the automatic driving vehicles, serious potential safety hazards exist in the training process, the probability of occurrence of vehicle driving accidents in the training process is high, meanwhile, in order to ensure the final training effect, the training time needs to be prolonged, and further the training efficiency of the automatic driving is low.
Disclosure of Invention
The invention provides an automatic driving training method, an automatic driving training device, automatic driving training equipment, a storage medium and a program product, which are used for solving the problems of poor safety and low efficiency in the prior art when an automatic driving vehicle is trained and realizing the safe and efficient training of the automatic driving vehicle.
The invention provides an automatic driving training method, which comprises the following steps: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining expert-like samples in the non-expert samples through a preset scorer, and improving training weights of the expert-like samples in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and training results of the expert-like samples and training results of the expert samples are within a preset error; and performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.
According to the automatic driving training method provided by the invention, the scorer comprises a distribution function of the driving simulation strategy; the determining, by a preset scorer, the expert-like samples in the non-expert samples includes: inputting each of the training samples into the scorer; adjusting the adjustable parameters of the scorer according to the conditional probability distribution values generated by the distribution function and respectively corresponding to each training sample; determining the expert-like samples in the non-expert samples through the adjusted scorer.
According to the automatic driving training method provided by the invention, the determining the expert-like sample in the non-expert sample through the adjusted scorer comprises the following steps: inputting each training sample into the adjusted scorer respectively to obtain a scoring result corresponding to each training sample; and determining the non-expert sample with the scoring result larger than a preset scoring threshold value as the class expert sample.
According to the automatic driving training method provided by the invention, the driving simulation strategy comprises the adjusted scorer; the method for improving the training weight of the expert-like sample in the driving simulation strategy comprises the following steps: and according to the scoring result corresponding to each training sample, improving the training weight of the expert-like sample in the driving simulation strategy, wherein the larger the scoring result corresponding to the training sample is, the larger the training weight corresponding to the training sample is.
According to the automatic driving training method provided by the present invention, after performing the simulation training on the automatic driving vehicle according to the driving simulation strategy with the adjusted weight by using the training sample, the method further includes: acquiring training times of the automatic driving vehicle; when the training times are determined to be smaller than a time threshold value, acquiring a new training sample set; adjusting the driving simulation strategy again through the new training sample set; and performing simulation training on the automatic driving vehicle again according to the driving simulation strategy after the adjustment again through a new training sample set.
According to the automatic driving training method provided by the invention, each training sample comprises a corresponding sample driving action and a sample driving state; the simulation training of the automatic driving vehicle is carried out according to the driving simulation strategy after the weight is adjusted through the training sample, and the simulation training comprises the following steps: performing the following processing on each training sample: inputting the sample driving action to the weighted driving simulation strategy; acquiring an actual driving state of the automatic driving vehicle after automatic driving according to the driving simulation strategy; and comparing the actual running state with the sample running state, and when the comparison result does not accord with a preset state error, adjusting the internal parameters of the driving simulation strategy until the comparison result accords with the preset state error.
The present invention also provides an automatic driving training apparatus, comprising: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a training sample set required by automatic driving training, and the training sample set comprises at least one expert sample and at least one non-expert sample; the adjusting module is used for determining expert-like samples in the non-expert samples through a preset scorer and improving training weights of the expert-like samples in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and training results of the expert-like samples and training results of the expert samples are within a preset error; and the training module is used for performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the steps of any of the above described automatic driving training methods when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the automated driving training method as described in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the automatic driving training method as described in any one of the above.
The automatic driving training method, the device, the equipment, the storage medium and the program product provided by the invention are used for acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample. And then, determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in the driving simulation strategy, wherein the training result of the class expert sample and the training result of the expert sample are within a preset error. And finally, performing simulation training on the automatic driving vehicle through a training sample according to the driving simulation strategy after the weight is adjusted. That is to say, for the expert-like sample with good training effect existing in the non-expert sample, the training weight of the expert-like sample in the non-expert sample can be improved through the above process, and because the training results of the expert-like sample and the expert sample are within the preset error range, the above process is equivalent to increasing the number of the expert samples, improving the safety in the training process of the automatic driving vehicle, and improving the training efficiency at the same time. In addition, the process determines and processes the expert-like samples in the non-expert samples, and compared with a mode that all the non-expert samples use the same training strategy, the utilization rate of the non-expert samples is higher, and the training effect of the automatic driving vehicle is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow diagram of an automated driving training method provided by the present invention;
FIG. 2 is a diagram illustrating an exemplary training sample set provided by the present invention;
FIG. 3 is a second schematic flow chart of the automatic driving training method provided by the present invention;
FIG. 4 is a schematic view of the structural connection of the automatic driving training device provided by the present invention;
fig. 5 is a schematic structural connection diagram of an electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The inventors have conducted analytical thinking about the prior art. In the prior art, in the process of training an automatic driving vehicle, a required training learning method needs to be selected according to actual conditions. The learning method in the prior art mainly includes simulation learning, reinforcement learning, inverse reinforcement learning, offline reinforcement learning, and the like.
Emulation Learning (emulation Learning) is a method of Learning expert strategies using a supervised Learning framework. The basic idea is to record the state o observed by an experttWhat kind of action a will be done latertThen establishing a state o by means of a neural network and the liketTo action atTo thereby observe the state o when the state o is observedtThe Agent can also make atSimilar actions, which enable the emulation of expert strategies, are called emulation learning.
Reinforcement Learning (Reinforcement Learning), which is one of the research hotspots of machine Learning, does not require labeled sample input and output, and tends to learn to obtain the optimal strategy for completing the task. The basic idea is that an Agent gets the stimulus of a reward r during the constant interaction with the environment and based on this step by step forms a prediction Q of the cumulative reward, an action a can be made that gets a higher value based on the current state*The following were used:
a*=maxa Q (1);
where max represents the maximum value. Thereby learning the strategy pi that can obtain the maximum reward*The following were used:
Figure BDA0003382343720000051
where t denotes the time, π denotes the strategy, γ is the depreciation factor, rtN is a value greater than 0 for the prize at time t.
The inverse reinforcement learning is a branch of the reinforcement learning method, and a reward function r is obtained by learning from expert teachingt=R(ot,at) And then, the training of reinforcement learning is guided according to the reward function, so that the problem that the reinforcement learning reward function is difficult to design is solved to a certain extent.
Offline Reinforcement Learning (Offline Reinforcement Learning) and Online Reinforcement Learning (Online Reinforcement Learning) are two major branches of Reinforcement Learning, and are decision methods based on data driving. In contrast to online reinforcement learning, offline reinforcement learning does not require online interaction of the agent with the environment at all, but rather from a historical dataset that records information about the agent's "state-action-reward-state { o, a, r, o' }" transitions
Figure BDA0003382343720000062
The middle learning obtains the optimal strategy to ensure that the strategy can obtain the maximum accumulated reward, and at the moment, the strategy pi of obtaining the maximum reward*The following were used:
Figure BDA0003382343720000061
in recent years, Reinforcement Learning (Reinforcement Learning) has been used as a decision-making method, and has many successful cases in many fields, revealing the potential of application of Reinforcement Learning. However, reinforcement learning requires defining a proper reward function, and the training effect can be seriously affected by slight carelessness when designing the reward function. Meanwhile, reinforcement learning requires that an intelligent agent continuously interacts with the environment, which is high in cost in application scenarios such as automatic driving, industrial optimization, medical diagnosis and treatment.
Deep learning, which is a data-driven algorithm, has been deployed in important fields such as face recognition, natural language processing, target detection, and defect detection on a large scale. It can be said that the success of deep learning stems largely from the birth of large-scale data sets, such as large visualization databases (ImageNet), and their data-driven paradigm. In fact, a great deal of historical data of decision making by different people, different machines or different algorithms can be collected, and it is particularly important to invent a decision making method based on data driving in order to transfer the successful experience of deep learning into the decision making method.
Since 2019, the generation of the offline reinforcement learning method gradually draws a wide attention. Offline reinforcement learning is used as a decision algorithm based on data driving, online interaction with the environment is not needed any more, and the method has certain application in the fields of automatic driving, mechanical arm control, game playing and the like. However, off-line reinforcement learning is similar to reinforcement learning, and still needs to define a proper reward function, which may seriously affect the training effect if careless.
The simulation Learning (emulation Learning) is used as a strategy Learning method based on a Supervised Learning (Supervised Learning) framework, does not need to manually design a complex reward function, is more convenient to realize, and is the most convenient decision-making method based on data driving. For example, historical data of driving of human experts can be collected in an application scene of automatic driving, and an agent can be trained to continuously simulate the driving behavior of human beings by adopting a simulation learning method, so that the behavior of driving can be learned. In fact, in addition to the field of automatic driving, the application of simulated learning to the field of robotics achieves the operational tasks of the robots. Meanwhile, the simulation learning also makes a major breakthrough in the fields of game playing and the like.
However, the conventional simulation learning method needs to collect a large amount of historical data (referred to as expert samples for short) of expert strategies, and the expert samples are rare. Taking human as an example, human experts have a smaller percentage of the total population and the cost of hiring human experts to generate expert samples is higher. Second, human experts may also make erroneous decisions due to occasional mistakes, further reducing the amount of expert data. Therefore, training the method of mock learning by only collecting expert samples would undoubtedly increase the cost of its on-ground application. In fields such as automatic driving, medical diagnosis, new drug development, industrial control, robotics, logistics scheduling, factory maintenance, etc., the historical decision data that is now more readily available is usually composed of a small number of expert samples and a large number of non-expert samples.
Therefore, to meet the economic requirements, existing mock learning methods typically develop training based on a small number of expert samples and a large number of non-expert samples. However, in such a data set in which non-expert samples are the main subject, the existing simulation learning method is affected by a large number of non-expert samples, and thus the performance of the obtained simulation strategy is poor.
In order to improve the training effect of the imitation learning in the training sample in which the non-expert sample accounts for the subject, the existing technical scheme can generally improve the weight of the expert sample in the data through a manual scoring method, so that the imitation learning is more prone to imitate the expert sample, or the weight of the sample with good performance is improved according to the scoring condition of interaction between a data set and the environment. Meanwhile, a reward function is obtained by adopting a reverse reinforcement learning method, and the weight of the sample is automatically adjusted through the reward function to guide the simulation learning to simulate a small amount of expert samples in the data set with preference.
However, the offline reinforcement learning method needs to define a reasonable reward function, and the training effect is influenced by a little carelessness. The manual marking method needs to employ a marker with relevant knowledge to mark and score the data set, which is time-consuming, labor-consuming and high in cost. The same time and effort is spent on scoring the sample by interacting with the environment, which is costly and less safe, especially in some realistic scenarios such as autopilot, medical diagnostics and industrial optimization. For example, in an automatic driving task, in order to evaluate the quality of a data set strategy, a vehicle needs to be run according to the strategy in a training sample, which undoubtedly brings about a significant potential safety hazard. The method for the inverse reinforcement learning is adopted, an inverse reinforcement learning task is additionally introduced on the basis of the existing simulation learning task, the problem of the inverse reinforcement learning needs to be solved in each cycle, and the complexity of calculation is obviously improved.
The inventor provides an automatic driving training method, an automatic driving training device, automatic driving training equipment, a storage medium and a program product based on analysis of the prior art and aiming at the problems of poor safety and low efficiency when an automatic driving vehicle is trained in the prior art. The automatic driving training method of the present invention is described below with reference to fig. 1 to 3.
In one embodiment, as shown in fig. 1, the automatic driving training method implements the following steps:
101, acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample;
in this embodiment, when training an autonomous vehicle, a training sample set is required, where the training sample set includes at least one expert sample and at least one non-expert sample. When automatic driving training is carried out, the expert sample refers to historical data of driving of a driver, which is rich in experience and has an accident rate lower than a preset accident threshold value; the non-expert sample refers to historical data of driving of a driver with inexperience and accident rate higher than a preset accident threshold value, and can also be historical data of driving of an automatic driving system needing manual taking over. As shown in fig. 2, a small number of expert samples and a large number of non-expert samples are used in automatic driving training of a vehicle. The vehicle is simulated and trained through the training sample set, so that the vehicle can learn sample driving actions such as lane keeping, lane merging, lane changing, overtaking, motion planning decision making and the like.
In this embodiment, the training sample set includes at least two training samples, and one training sample is any one of an expert sample and a non-expert sample. One training sample comprises a sample driving action and a corresponding sample driving state, wherein the sample driving action refers to an operation action of driving a vehicle in the sample, is denoted by a and comprises actions of accelerating, steering, braking and the like, and the sample driving action can be acquired by a sensor and other devices; the sample driving state refers to a state in which the vehicle is driven in the sample driving action, and is represented by o, for example, a distance from a front vehicle, and the sample driving state may be acquired after the vehicle image is acquired by a vehicle data recorder, an intelligent camera, or the like.
Step 102, determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error;
in this embodiment, the scorer is a preset scoring model, and after the data of the training sample is input into the scorer, the scorer can score the training sample, and when the training sample is input into the scorer, the higher the score output by the scorer is, the closer the training sample is to the expert sample is indicated; the lower the score output by the scorer, the closer the training sample is to the non-expert sample. That is, the scorer can screen the training samples in the set of training samples that are closer to the expert sample.
In this embodiment, each training sample is scored by the scorer, so that the expert-like samples in the non-expert samples can be determined, the training weight of the expert-like samples in the driving simulation strategy is increased, the expert-like samples are treated as the expert samples, which is equivalent to increasing the number of the expert samples and improving the training efficiency.
In one embodiment, the scorer may be represented by a parameterized function approximator, which may be implemented using an intelligent algorithm, such as a neural network algorithm. The scorer can also be designed manually, for example, the scorer used for automatic driving training is designed as follows:
Figure BDA0003382343720000101
wherein S (-) represents the final composite autopilot score, S (-)sucScore of completion of the driving task, S (-)SafeRepresents a safety score during driving, delta (·)ComfortableTo show comfortAnd (4) scoring the sex.
Scc is the success rate of the driving task, Dis is the average distance driven, Acc is the acceleration of the vehicle, TTC is the predicted time to collision with a nearby vehicle, D is the distance to a nearby vehicle, C is an indicator of whether a collision occurred (collision is 1, no collision is 0), V is the speed of the vehicle, T is the opening degree of the accelerator, R is the steering wheel angle, B is the braking force, α is1~α10And beta1~β3Are all adjustable parameters (i.e., tunable parameters). By continuously adjusting alpha1~α10And beta1~β3The scorer S (-) can give higher scores for driving behaviors of human experts and lower scores for those poor driving behaviors.
In this embodiment, when the scorer is initially established, it is necessary to perform random initialization operation on adjustable parameters, such as α described above1~α10And beta1~β3Thereby preventing the subsequent optimization process from falling into a locally optimal solution.
In one embodiment, the scorer is learned in positive and negative examples. Specifically, α is adjusted using the principle shown in the following formula (5)1~α10And beta1~β3
Figure BDA0003382343720000102
Where o and a are the sample driving state and corresponding sample driving action in each training sample, respectively. (o, a) -E indicate that the training sample (o, a) was collected from the expert sample E. (O, a) -O indicate that the training samples (O, a) were collected from the non-expert samples O. S (-) is the composite scorer in equation (4), S (o, a) is used to output the similarity degree of (o, a) and expert sample, λ is a hyper-parameter adjusted by human,
Figure BDA0003382343720000103
indicating the expected value. Updating the parameters of S (-) according to the above formula, i.e. adjusting alpha1~α10And beta1~β3. The above equation (5) shows that the parameters in S (-) are adjusted with the objective of obtaining the minimum value by the equation on the right side of min.
After adjustment, the scorer can output a higher value after meeting the data in the expert sample, and output a lower value after meeting the data in the non-expert sample, thereby realizing the differentiation of the training samples. However, there will be many data similar to the expert sample in the non-expert sample, and it is obviously unreasonable to uniformly consider the data in the non-expert sample as poor-performing data. Therefore, a new scorer needs to be designed for mining data that performs better in non-expert samples.
In one embodiment, in order to mine better performing data in the non-expert sample, i.e., mining-class expert data, the distribution function of the driving simulation strategy is used as part of the input to the scorer, i.e., the scorer contains the distribution function of the driving simulation strategy. Specifically, the expert-like samples in the non-expert samples are determined through a preset scorer, and the implementation process is as follows: inputting each training sample into a scorer; adjusting adjustable parameters of the scorer according to the conditional probability distribution values generated by the distribution function and respectively corresponding to each training sample; and determining the expert-like samples in the non-expert samples through the adjusted scorer.
In one specific embodiment, the scorer is learned by using the following equation (6), i.e., principle adjustment α1~α10And beta1~β3
Figure BDA0003382343720000111
Where μ denotes the driving simulation strategy and μ (a | o) is a conditional probability distribution representing a conditional probability distribution function of the sample driving action a with respect to the driving state o, i.e. a distribution function of the driving simulation strategy. If the driving simulation strategy mu is very similar to the expert strategy (i.e. the assumed accident-free driving strategy), then mu (ajo) outputs a value approaching 1 when the training sample is similar to the expert sample, and conversely outputs a value approaching 0. Updating the parameters of S (-) according to the above formula, namely, the formula (6) shows that the parameters in S (-) are adjusted by taking the formula on the right side of min as the target to obtain the minimum value, so that the scorer can see the sample similar to the expert sample and then output a higher value, and see the data greatly different from the expert sample and then output a lower value, thereby the scorer can easily mine the expert data in the non-expert sample.
In one embodiment, after the adjustable parameters in the scorer are adjusted based on the above embodiment, the adjusted scorer is used to determine the expert-like sample in the non-expert sample, and the specific process is as follows: inputting each training sample into the adjusted scorer respectively to obtain a scoring result corresponding to each training sample; and determining the non-expert sample with the scoring result larger than the preset scoring threshold value as an expert-like sample.
In one embodiment, after the adjustable parameters in the scorer are adjusted based on the above embodiments, the driving simulation strategy is completed by using the adjusted scorer, that is, the driving simulation strategy includes the adjusted scorer. The training weight of the class expert sample in the driving simulation strategy is improved, and the specific process is as follows: and according to the scoring result corresponding to each training sample, improving the training weight of the expert-like sample in the driving simulation strategy, wherein the larger the scoring result corresponding to the training sample is, the larger the training weight corresponding to the training sample is.
In one specific embodiment, equation (6) may be viewed as a functional with respect to the driving simulation strategy μ, which may be used as an optimization variable from the viewpoint of functional analysis. The first derivative of μ is calculated according to equation (6) and combined with the optimization objective of equation (4) to obtain the optimization objective as shown in equation (7).
Figure BDA0003382343720000121
Where S is an abbreviation for S (o, a, log μ (a | o)), and δ is a manually adjusted hyperparameter. The above formula (7) shows that the driving model is adjusted by taking the minimum value obtained by the formula on the right side of min as the targetParameters in the strategy mu are imitated. Because equation (7) is transformed from equation (6), using the optimization objective shown in equation (7), namely the driving simulation strategy μ, is equivalent to having a preference to simulate the expert sample and the class expert sample because
Figure BDA0003382343720000122
The weight of expert data in the non-expert sample can be automatically increased, the weight of data which is different from the expert sample in the non-expert sample by a large amount is automatically reduced, and the expert data in the non-expert sample is conveniently mined.
And 103, performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.
In this embodiment, after the training weights of the expert-like samples in the driving simulation strategy are increased, simulation training is performed on the automatically driven vehicle according to the driving simulation strategy with the adjusted weights through the training samples.
In one embodiment, each training sample includes a corresponding sample driving action and sample driving state. Through training the sample, according to the driving simulation strategy after adjusting the weight, imitate the training to the automatic driving vehicle, the concrete implementation process is: the following processing is performed for each training sample: inputting the sample driving action into the driving simulation strategy after the weight is adjusted; acquiring an actual driving state of an automatically driven vehicle after the vehicle automatically drives according to a driving simulation strategy; and comparing the actual running state with the sample running state, and adjusting the internal parameters of the driving simulation strategy when the comparison result does not accord with the preset state error until the comparison result accords with the preset state error.
In one embodiment, a phased training method may be used when training an autonomous vehicle. Specifically, after the automatic driving vehicle is subjected to simulation training through a training sample according to the driving simulation strategy after the weight is adjusted, the training times of the automatic driving vehicle are obtained; when the training times are determined to be smaller than the times threshold value, acquiring a new training sample set; adjusting the driving simulation strategy again through a new training sample set; and performing simulation training again on the automatic driving vehicle according to the driving simulation strategy after the adjustment again through the new training sample set.
In one specific example, as shown in FIG. 3, the entire training process for an autonomous vehicle is as follows:
step 301, initializing parameters of a driving simulation strategy and a scoring device;
step 302, extracting a training sample set from a sample database;
step 303, learning and updating a scorer S (-) which comprises a distribution function of the driving simulation strategy;
step 304, updating the driving simulation strategy mu by using the updated scorer S (-);
step 305, judging whether the training times are smaller than a time threshold, if so, executing step 302, and if not, executing step 306;
step 306, ending the training.
The automatic driving training method provided by the invention obtains a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample. And then, determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in the driving simulation strategy, wherein the training result of the class expert sample and the training result of the expert sample are within a preset error. And finally, performing simulation training on the automatic driving vehicle through a training sample according to the driving simulation strategy after the weight is adjusted. That is to say, for the expert-like sample with good training effect existing in the non-expert sample, the training weight of the expert-like sample in the non-expert sample can be improved through the above process, and because the training results of the expert-like sample and the expert sample are within the preset error range, the above process is equivalent to increasing the number of the expert samples, improving the safety in the training process of the automatic driving vehicle, and improving the training efficiency at the same time. In addition, the process determines and processes the expert-like samples in the non-expert samples, and compared with a mode that all the non-expert samples use the same training strategy, the utilization rate of the non-expert samples is higher, and the training effect of the automatic driving vehicle is further improved.
In the invention, the score output by the scorer is used for automatically adjusting the weight of the expert-like sample in the simulation learning process. If the scorer shows that the driving simulation strategy is similar to the expert strategy, the weight of the driving simulation strategy is increased, and the driving simulation strategy is encouraged to better simulate the expert strategy. Conversely, if the scorer indicates that the driving simulation strategy differs significantly from the expert strategy, its weight is reduced, preventing the driving simulation strategy from being updated in a direction away from the expert strategy.
In addition, the invention adopts a supervised learning framework to introduce only one scorer for judging the similarity degree of the sample and the expert sample, only adds the supervised learning of the scorer in addition to the original simulated learning, does not need to define a proper reward function, and has low realization difficulty and small calculation complexity. Meanwhile, the driving simulation system is modified on the basis of the existing scorer, the distribution function of the driving simulation strategy is used as an additional input signal, and compared with the existing scorer, the driving simulation system has the capability of evaluating the similarity degree of a training sample and an expert sample, so that the expert-like data in a non-expert sample can be mined, and the dependence of simulation learning on the number of the expert samples is reduced. In addition, the method does not need extra human resources to label the training samples at all, does not need extra environmental interaction, has low cost and high safety, further improves the safety in the automatic driving process and improves the training efficiency. Moreover, if the method is expanded and applied to various application scenes with scarce expert samples and high cost, such as medical diagnosis and treatment, industrial optimization, robots, logistics scheduling and the like similar to the automatic driving training scene, powerful support is provided for the application of various scenes, the landing cost is reduced to a great extent, and the method has great economic potential.
The following describes the automatic driving training device provided by the present invention, and the automatic driving training device described below and the automatic driving training method described above may be referred to in correspondence with each other. The repetition is not described in detail. As shown in fig. 4, the automatic driving training device includes:
an obtaining module 401, configured to obtain a training sample set required by automatic driving training, where the training sample set includes at least one expert sample and at least one non-expert sample;
an adjusting module 402, configured to determine, through a preset scorer, a class expert sample in each non-expert sample, and improve a training weight of the class expert sample in the driving simulation strategy, where the scorer is configured to grade each training sample in a training sample set, and a training result of the class expert sample and a training result of the expert sample are within a preset error;
and a training module 403, configured to perform simulation training on the autonomous vehicle according to the driving simulation strategy with the adjusted weights through the training samples.
In one embodiment, the adjusting module 402 is specifically configured to input each training sample into the scorer; adjusting adjustable parameters of the scorer according to the conditional probability distribution values generated by the distribution function and respectively corresponding to each training sample; and determining an expert-like sample in the non-expert sample through the adjusted scorer, wherein the scorer comprises a distribution function of the driving simulation strategy.
In one embodiment, the adjusting module 402 is specifically configured to input each training sample to the adjusted scorer, and obtain a scoring result corresponding to each training sample; and determining the non-expert sample with the scoring result larger than the preset scoring threshold value as an expert-like sample.
The adjusting module 402 is specifically configured to increase the training weight of the expert-like sample in the driving simulation strategy according to the scoring result corresponding to each training sample, where the greater the scoring result corresponding to the training sample is, the greater the training weight corresponding to the training sample is, and the driving simulation strategy includes an adjusted scorer.
In an embodiment, the obtaining module 401 is specifically configured to perform the following processing on each training sample: inputting the sample driving action into the driving simulation strategy after the weight is adjusted; acquiring an actual driving state of an automatically driven vehicle after the vehicle automatically drives according to a driving simulation strategy; and comparing the actual running state with the sample running state, and adjusting the internal parameters of the driving simulation strategy when the comparison result does not accord with the preset state error until the comparison result accords with the preset state error.
In one embodiment, the adjusting module 402 is specifically configured to obtain training times of the autonomous vehicle after performing simulation training on the autonomous vehicle according to the driving simulation strategy with the adjusted weight by using a training sample; when the training times are determined to be smaller than the times threshold value, acquiring a new training sample set; adjusting the driving simulation strategy again through a new training sample set; and performing simulation training again on the automatic driving vehicle according to the driving simulation strategy after the adjustment again through the new training sample set.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)501, a communication Interface (Communications Interface)502, a memory (memory)503, and a communication bus 504, wherein the processor 501, the communication Interface 502, and the memory 503 are configured to communicate with each other via the communication bus 504. Processor 501 may call logic instructions in memory 503 to perform an automated driving training method comprising: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error; and performing simulation training on the automatic driving vehicle through the training sample according to the driving simulation strategy after the weight is adjusted.
In addition, the logic instructions in the memory 503 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the automatic driving training method provided by the above methods, the method comprising: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error; and performing simulation training on the automatic driving vehicle through the training sample according to the driving simulation strategy after the weight is adjusted.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an automated driving training method provided by the above methods, the method comprising: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error; and performing simulation training on the automatic driving vehicle through the training sample according to the driving simulation strategy after the weight is adjusted.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An automated driving training method, comprising:
acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample;
determining expert-like samples in the non-expert samples through a preset scorer, and improving training weights of the expert-like samples in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and training results of the expert-like samples and training results of the expert samples are within a preset error;
and performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.
2. The automated driving training method of claim 1, wherein the scorer comprises a distribution function of the driving simulation strategy;
the determining, by a preset scorer, the expert-like samples in the non-expert samples includes:
inputting each of the training samples into the scorer;
adjusting the adjustable parameters of the scorer according to the conditional probability distribution values generated by the distribution function and respectively corresponding to each training sample;
determining the expert-like samples in the non-expert samples through the adjusted scorer.
3. The automated driver training method of claim 2, wherein the determining the expert-like sample of the non-expert samples by the adjusted scorer comprises:
inputting each training sample into the adjusted scorer respectively to obtain a scoring result corresponding to each training sample;
and determining the non-expert sample with the scoring result larger than a preset scoring threshold value as the class expert sample.
4. The automated driving training method of claim 3, wherein the driving simulation strategy includes the adjusted scorer;
the method for improving the training weight of the expert-like sample in the driving simulation strategy comprises the following steps:
and according to the scoring result corresponding to each training sample, improving the training weight of the expert-like sample in the driving simulation strategy, wherein the larger the scoring result corresponding to the training sample is, the larger the training weight corresponding to the training sample is.
5. The automated driving training method according to claim 1, wherein after performing simulation training on the automated driving vehicle according to the driving simulation strategy with the adjusted weight by using the training samples, the method further comprises:
acquiring training times of the automatic driving vehicle;
when the training times are determined to be smaller than a time threshold value, acquiring a new training sample set;
adjusting the driving simulation strategy again through the new training sample set;
and performing simulation training on the automatic driving vehicle again according to the driving simulation strategy after the adjustment again through a new training sample set.
6. The automated driving training method of claim 1, wherein each of the training samples comprises a corresponding sample driving action and a sample driving state;
the simulation training of the automatic driving vehicle is carried out according to the driving simulation strategy after the weight is adjusted through the training sample, and the simulation training comprises the following steps:
performing the following processing on each training sample:
inputting the sample driving action to the weighted driving simulation strategy;
acquiring an actual driving state of the automatic driving vehicle after automatic driving according to the driving simulation strategy;
and comparing the actual running state with the sample running state, and when the comparison result does not accord with a preset state error, adjusting the internal parameters of the driving simulation strategy until the comparison result accords with the preset state error.
7. An automated driving training apparatus, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a training sample set required by automatic driving training, and the training sample set comprises at least one expert sample and at least one non-expert sample;
the adjusting module is used for determining expert-like samples in the non-expert samples through a preset scorer and improving training weights of the expert-like samples in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and training results of the expert-like samples and training results of the expert samples are within a preset error;
and the training module is used for performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the automated driving training method according to any of claims 1 to 6 are implemented when the program is executed by the processor.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the automated driving training method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the automatic drive training method according to any one of claims 1 to 6 when executed by a processor.
CN202111437745.0A 2021-11-29 2021-11-29 Automatic driving training method, device, equipment, storage medium and program product Pending CN114372501A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111437745.0A CN114372501A (en) 2021-11-29 2021-11-29 Automatic driving training method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111437745.0A CN114372501A (en) 2021-11-29 2021-11-29 Automatic driving training method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN114372501A true CN114372501A (en) 2022-04-19

Family

ID=81139520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111437745.0A Pending CN114372501A (en) 2021-11-29 2021-11-29 Automatic driving training method, device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN114372501A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114986512A (en) * 2022-06-27 2022-09-02 清华大学 Dual preference simulation learning method and system supported by dynamic model
CN115099037A (en) * 2022-06-27 2022-09-23 清华大学 Preference simulation learning method and preference simulation learning system supported by dynamic model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114986512A (en) * 2022-06-27 2022-09-02 清华大学 Dual preference simulation learning method and system supported by dynamic model
CN115099037A (en) * 2022-06-27 2022-09-23 清华大学 Preference simulation learning method and preference simulation learning system supported by dynamic model

Similar Documents

Publication Publication Date Title
US11062617B2 (en) Training system for autonomous driving control policy
CN110949398B (en) Method for detecting abnormal driving behavior of first-vehicle drivers in vehicle formation driving
CN108227710B (en) Automatic driving control method and apparatus, electronic device, program, and medium
Zhang et al. Query-efficient imitation learning for end-to-end autonomous driving
CN114358128B (en) Method for training end-to-end automatic driving strategy
CN111231983B (en) Vehicle control method, device and equipment based on traffic accident memory network
CN112232490B (en) Visual-based depth simulation reinforcement learning driving strategy training method
US12005922B2 (en) Toward simulation of driver behavior in driving automation
EP4022514B1 (en) Multi-agent simulations
CN112068549B (en) Unmanned system cluster control method based on deep reinforcement learning
CN111483468A (en) Unmanned vehicle lane change decision-making method and system based on confrontation and imitation learning
CN112508164B (en) End-to-end automatic driving model pre-training method based on asynchronous supervised learning
CN110119714B (en) Driver fatigue detection method and device based on convolutional neural network
CN115053237A (en) Vehicle intent prediction neural network
CN111260027A (en) Intelligent agent automatic decision-making method based on reinforcement learning
CN114372501A (en) Automatic driving training method, device, equipment, storage medium and program product
EP4216098A1 (en) Methods and apparatuses for constructing vehicle dynamics model and for predicting vehicle state information
Babiker et al. Convolutional neural network for a self-driving car in a virtual environment
CN117610681A (en) Automatic driving automobile decision-making method based on imitation learning and discrete reinforcement learning
CN117719535A (en) Human feedback automatic driving vehicle interactive self-adaptive decision control method
DE102019203634A1 (en) Method and device for controlling a robot
CN114148349B (en) Vehicle personalized following control method based on generation of countermeasure imitation study
CN116300944A (en) Automatic driving decision method and system based on improved Double DQN
CN115981302A (en) Vehicle following lane change behavior decision-making method and device and electronic equipment
CN112560354B (en) Car following behavior modeling method based on Gaussian process regression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination