CN114372501A

CN114372501A - Automatic driving training method, device, equipment, storage medium and program product

Info

Publication number: CN114372501A
Application number: CN202111437745.0A
Authority: CN
Inventors: 詹仙园; 李键雄
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-04-19

Abstract

The present invention relates to the field of automatic driving technologies, and in particular, to an automatic driving training method, apparatus, device, storage medium, and program product. The method comprises the following steps: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error; and performing simulation training on the automatic driving vehicle through the training sample according to the driving simulation strategy after the weight is adjusted. The automatic driving vehicle training system is used for solving the problems of poor safety and low efficiency in the prior art when the automatic driving vehicle is trained, and realizing the safe and efficient training of the automatic driving vehicle.

Description

Automatic driving training method, device, equipment, storage medium and program product

Technical Field

The present invention relates to the field of automatic driving technologies, and in particular, to an automatic driving training method, apparatus, device, storage medium, and program product.

Background

With the development of science and technology, automatic driving gradually comes into the field of vision of people. In order to ensure the safety of automatic driving, the training process of the automatic driving vehicle is very important. In the prior art, a large number of training samples are required for training an automatic driving vehicle, and the training samples comprise historical driving data of drivers with abundant experience and low accident rate, historical driving data of drivers with insufficient experience and high accident rate, and historical driving data of automatic driving systems needing manual taking over. However, because the historical data of the driving of the drivers with short experience and high accident rate occupies a large proportion of the training samples, when the training samples are used for simulation training of the automatic driving vehicles, serious potential safety hazards exist in the training process, the probability of occurrence of vehicle driving accidents in the training process is high, meanwhile, in order to ensure the final training effect, the training time needs to be prolonged, and further the training efficiency of the automatic driving is low.

Disclosure of Invention

The invention provides an automatic driving training method, an automatic driving training device, automatic driving training equipment, a storage medium and a program product, which are used for solving the problems of poor safety and low efficiency in the prior art when an automatic driving vehicle is trained and realizing the safe and efficient training of the automatic driving vehicle.

The invention provides an automatic driving training method, which comprises the following steps: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining expert-like samples in the non-expert samples through a preset scorer, and improving training weights of the expert-like samples in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and training results of the expert-like samples and training results of the expert samples are within a preset error; and performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.

According to the automatic driving training method provided by the invention, the scorer comprises a distribution function of the driving simulation strategy; the determining, by a preset scorer, the expert-like samples in the non-expert samples includes: inputting each of the training samples into the scorer; adjusting the adjustable parameters of the scorer according to the conditional probability distribution values generated by the distribution function and respectively corresponding to each training sample; determining the expert-like samples in the non-expert samples through the adjusted scorer.

According to the automatic driving training method provided by the invention, the determining the expert-like sample in the non-expert sample through the adjusted scorer comprises the following steps: inputting each training sample into the adjusted scorer respectively to obtain a scoring result corresponding to each training sample; and determining the non-expert sample with the scoring result larger than a preset scoring threshold value as the class expert sample.

According to the automatic driving training method provided by the invention, the driving simulation strategy comprises the adjusted scorer; the method for improving the training weight of the expert-like sample in the driving simulation strategy comprises the following steps: and according to the scoring result corresponding to each training sample, improving the training weight of the expert-like sample in the driving simulation strategy, wherein the larger the scoring result corresponding to the training sample is, the larger the training weight corresponding to the training sample is.

According to the automatic driving training method provided by the present invention, after performing the simulation training on the automatic driving vehicle according to the driving simulation strategy with the adjusted weight by using the training sample, the method further includes: acquiring training times of the automatic driving vehicle; when the training times are determined to be smaller than a time threshold value, acquiring a new training sample set; adjusting the driving simulation strategy again through the new training sample set; and performing simulation training on the automatic driving vehicle again according to the driving simulation strategy after the adjustment again through a new training sample set.

According to the automatic driving training method provided by the invention, each training sample comprises a corresponding sample driving action and a sample driving state; the simulation training of the automatic driving vehicle is carried out according to the driving simulation strategy after the weight is adjusted through the training sample, and the simulation training comprises the following steps: performing the following processing on each training sample: inputting the sample driving action to the weighted driving simulation strategy; acquiring an actual driving state of the automatic driving vehicle after automatic driving according to the driving simulation strategy; and comparing the actual running state with the sample running state, and when the comparison result does not accord with a preset state error, adjusting the internal parameters of the driving simulation strategy until the comparison result accords with the preset state error.

The present invention also provides an automatic driving training apparatus, comprising: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a training sample set required by automatic driving training, and the training sample set comprises at least one expert sample and at least one non-expert sample; the adjusting module is used for determining expert-like samples in the non-expert samples through a preset scorer and improving training weights of the expert-like samples in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and training results of the expert-like samples and training results of the expert samples are within a preset error; and the training module is used for performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the steps of any of the above described automatic driving training methods when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the automated driving training method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the automatic driving training method as described in any one of the above.

The automatic driving training method, the device, the equipment, the storage medium and the program product provided by the invention are used for acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample. And then, determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in the driving simulation strategy, wherein the training result of the class expert sample and the training result of the expert sample are within a preset error. And finally, performing simulation training on the automatic driving vehicle through a training sample according to the driving simulation strategy after the weight is adjusted. That is to say, for the expert-like sample with good training effect existing in the non-expert sample, the training weight of the expert-like sample in the non-expert sample can be improved through the above process, and because the training results of the expert-like sample and the expert sample are within the preset error range, the above process is equivalent to increasing the number of the expert samples, improving the safety in the training process of the automatic driving vehicle, and improving the training efficiency at the same time. In addition, the process determines and processes the expert-like samples in the non-expert samples, and compared with a mode that all the non-expert samples use the same training strategy, the utilization rate of the non-expert samples is higher, and the training effect of the automatic driving vehicle is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow diagram of an automated driving training method provided by the present invention;

FIG. 2 is a diagram illustrating an exemplary training sample set provided by the present invention;

FIG. 3 is a second schematic flow chart of the automatic driving training method provided by the present invention;

FIG. 4 is a schematic view of the structural connection of the automatic driving training device provided by the present invention;

fig. 5 is a schematic structural connection diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The inventors have conducted analytical thinking about the prior art. In the prior art, in the process of training an automatic driving vehicle, a required training learning method needs to be selected according to actual conditions. The learning method in the prior art mainly includes simulation learning, reinforcement learning, inverse reinforcement learning, offline reinforcement learning, and the like.

Emulation Learning (emulation Learning) is a method of Learning expert strategies using a supervised Learning framework. The basic idea is to record the state o observed by an expert_tWhat kind of action a will be done later_tThen establishing a state o by means of a neural network and the like_tTo action a_tTo thereby observe the state o when the state o is observed_tThe Agent can also make a_tSimilar actions, which enable the emulation of expert strategies, are called emulation learning.

Reinforcement Learning (Reinforcement Learning), which is one of the research hotspots of machine Learning, does not require labeled sample input and output, and tends to learn to obtain the optimal strategy for completing the task. The basic idea is that an Agent gets the stimulus of a reward r during the constant interaction with the environment and based on this step by step forms a prediction Q of the cumulative reward, an action a can be made that gets a higher value based on the current state^*The following were used:

a^*＝max_a Q (1)；

where max represents the maximum value. Thereby learning the strategy pi that can obtain the maximum reward^*The following were used:

where t denotes the time, π denotes the strategy, γ is the depreciation factor, r_tN is a value greater than 0 for the prize at time t.

The inverse reinforcement learning is a branch of the reinforcement learning method, and a reward function r is obtained by learning from expert teaching_t＝R(o_t,a_t) And then, the training of reinforcement learning is guided according to the reward function, so that the problem that the reinforcement learning reward function is difficult to design is solved to a certain extent.

Offline Reinforcement Learning (Offline Reinforcement Learning) and Online Reinforcement Learning (Online Reinforcement Learning) are two major branches of Reinforcement Learning, and are decision methods based on data driving. In contrast to online reinforcement learning, offline reinforcement learning does not require online interaction of the agent with the environment at all, but rather from a historical dataset that records information about the agent's "state-action-reward-state { o, a, r, o' }" transitions

The middle learning obtains the optimal strategy to ensure that the strategy can obtain the maximum accumulated reward, and at the moment, the strategy pi of obtaining the maximum reward^*The following were used:

in recent years, Reinforcement Learning (Reinforcement Learning) has been used as a decision-making method, and has many successful cases in many fields, revealing the potential of application of Reinforcement Learning. However, reinforcement learning requires defining a proper reward function, and the training effect can be seriously affected by slight carelessness when designing the reward function. Meanwhile, reinforcement learning requires that an intelligent agent continuously interacts with the environment, which is high in cost in application scenarios such as automatic driving, industrial optimization, medical diagnosis and treatment.

Deep learning, which is a data-driven algorithm, has been deployed in important fields such as face recognition, natural language processing, target detection, and defect detection on a large scale. It can be said that the success of deep learning stems largely from the birth of large-scale data sets, such as large visualization databases (ImageNet), and their data-driven paradigm. In fact, a great deal of historical data of decision making by different people, different machines or different algorithms can be collected, and it is particularly important to invent a decision making method based on data driving in order to transfer the successful experience of deep learning into the decision making method.

Since 2019, the generation of the offline reinforcement learning method gradually draws a wide attention. Offline reinforcement learning is used as a decision algorithm based on data driving, online interaction with the environment is not needed any more, and the method has certain application in the fields of automatic driving, mechanical arm control, game playing and the like. However, off-line reinforcement learning is similar to reinforcement learning, and still needs to define a proper reward function, which may seriously affect the training effect if careless.

The simulation Learning (emulation Learning) is used as a strategy Learning method based on a Supervised Learning (Supervised Learning) framework, does not need to manually design a complex reward function, is more convenient to realize, and is the most convenient decision-making method based on data driving. For example, historical data of driving of human experts can be collected in an application scene of automatic driving, and an agent can be trained to continuously simulate the driving behavior of human beings by adopting a simulation learning method, so that the behavior of driving can be learned. In fact, in addition to the field of automatic driving, the application of simulated learning to the field of robotics achieves the operational tasks of the robots. Meanwhile, the simulation learning also makes a major breakthrough in the fields of game playing and the like.

However, the conventional simulation learning method needs to collect a large amount of historical data (referred to as expert samples for short) of expert strategies, and the expert samples are rare. Taking human as an example, human experts have a smaller percentage of the total population and the cost of hiring human experts to generate expert samples is higher. Second, human experts may also make erroneous decisions due to occasional mistakes, further reducing the amount of expert data. Therefore, training the method of mock learning by only collecting expert samples would undoubtedly increase the cost of its on-ground application. In fields such as automatic driving, medical diagnosis, new drug development, industrial control, robotics, logistics scheduling, factory maintenance, etc., the historical decision data that is now more readily available is usually composed of a small number of expert samples and a large number of non-expert samples.

Therefore, to meet the economic requirements, existing mock learning methods typically develop training based on a small number of expert samples and a large number of non-expert samples. However, in such a data set in which non-expert samples are the main subject, the existing simulation learning method is affected by a large number of non-expert samples, and thus the performance of the obtained simulation strategy is poor.

In order to improve the training effect of the imitation learning in the training sample in which the non-expert sample accounts for the subject, the existing technical scheme can generally improve the weight of the expert sample in the data through a manual scoring method, so that the imitation learning is more prone to imitate the expert sample, or the weight of the sample with good performance is improved according to the scoring condition of interaction between a data set and the environment. Meanwhile, a reward function is obtained by adopting a reverse reinforcement learning method, and the weight of the sample is automatically adjusted through the reward function to guide the simulation learning to simulate a small amount of expert samples in the data set with preference.

However, the offline reinforcement learning method needs to define a reasonable reward function, and the training effect is influenced by a little carelessness. The manual marking method needs to employ a marker with relevant knowledge to mark and score the data set, which is time-consuming, labor-consuming and high in cost. The same time and effort is spent on scoring the sample by interacting with the environment, which is costly and less safe, especially in some realistic scenarios such as autopilot, medical diagnostics and industrial optimization. For example, in an automatic driving task, in order to evaluate the quality of a data set strategy, a vehicle needs to be run according to the strategy in a training sample, which undoubtedly brings about a significant potential safety hazard. The method for the inverse reinforcement learning is adopted, an inverse reinforcement learning task is additionally introduced on the basis of the existing simulation learning task, the problem of the inverse reinforcement learning needs to be solved in each cycle, and the complexity of calculation is obviously improved.

The inventor provides an automatic driving training method, an automatic driving training device, automatic driving training equipment, a storage medium and a program product based on analysis of the prior art and aiming at the problems of poor safety and low efficiency when an automatic driving vehicle is trained in the prior art. The automatic driving training method of the present invention is described below with reference to fig. 1 to 3.

In one embodiment, as shown in fig. 1, the automatic driving training method implements the following steps:

101, acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample;

in this embodiment, when training an autonomous vehicle, a training sample set is required, where the training sample set includes at least one expert sample and at least one non-expert sample. When automatic driving training is carried out, the expert sample refers to historical data of driving of a driver, which is rich in experience and has an accident rate lower than a preset accident threshold value; the non-expert sample refers to historical data of driving of a driver with inexperience and accident rate higher than a preset accident threshold value, and can also be historical data of driving of an automatic driving system needing manual taking over. As shown in fig. 2, a small number of expert samples and a large number of non-expert samples are used in automatic driving training of a vehicle. The vehicle is simulated and trained through the training sample set, so that the vehicle can learn sample driving actions such as lane keeping, lane merging, lane changing, overtaking, motion planning decision making and the like.

In this embodiment, the training sample set includes at least two training samples, and one training sample is any one of an expert sample and a non-expert sample. One training sample comprises a sample driving action and a corresponding sample driving state, wherein the sample driving action refers to an operation action of driving a vehicle in the sample, is denoted by a and comprises actions of accelerating, steering, braking and the like, and the sample driving action can be acquired by a sensor and other devices; the sample driving state refers to a state in which the vehicle is driven in the sample driving action, and is represented by o, for example, a distance from a front vehicle, and the sample driving state may be acquired after the vehicle image is acquired by a vehicle data recorder, an intelligent camera, or the like.

Step 102, determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error;

in this embodiment, the scorer is a preset scoring model, and after the data of the training sample is input into the scorer, the scorer can score the training sample, and when the training sample is input into the scorer, the higher the score output by the scorer is, the closer the training sample is to the expert sample is indicated; the lower the score output by the scorer, the closer the training sample is to the non-expert sample. That is, the scorer can screen the training samples in the set of training samples that are closer to the expert sample.

In this embodiment, each training sample is scored by the scorer, so that the expert-like samples in the non-expert samples can be determined, the training weight of the expert-like samples in the driving simulation strategy is increased, the expert-like samples are treated as the expert samples, which is equivalent to increasing the number of the expert samples and improving the training efficiency.

In one embodiment, the scorer may be represented by a parameterized function approximator, which may be implemented using an intelligent algorithm, such as a neural network algorithm. The scorer can also be designed manually, for example, the scorer used for automatic driving training is designed as follows:

wherein S (-) represents the final composite autopilot score, S (-)_sucScore of completion of the driving task, S (-)_SafeRepresents a safety score during driving, delta (·)_ComfortableTo show comfortAnd (4) scoring the sex.

Scc is the success rate of the driving task, Dis is the average distance driven, Acc is the acceleration of the vehicle, TTC is the predicted time to collision with a nearby vehicle, D is the distance to a nearby vehicle, C is an indicator of whether a collision occurred (collision is 1, no collision is 0), V is the speed of the vehicle, T is the opening degree of the accelerator, R is the steering wheel angle, B is the braking force, α is₁～α₁₀And beta₁～β₃Are all adjustable parameters (i.e., tunable parameters). By continuously adjusting alpha₁～α₁₀And beta₁～β₃The scorer S (-) can give higher scores for driving behaviors of human experts and lower scores for those poor driving behaviors.

In this embodiment, when the scorer is initially established, it is necessary to perform random initialization operation on adjustable parameters, such as α described above₁～α₁₀And beta₁～β₃Thereby preventing the subsequent optimization process from falling into a locally optimal solution.

In one embodiment, the scorer is learned in positive and negative examples. Specifically, α is adjusted using the principle shown in the following formula (5)₁～α₁₀And beta₁～β₃：

Where o and a are the sample driving state and corresponding sample driving action in each training sample, respectively. (o, a) -E indicate that the training sample (o, a) was collected from the expert sample E. (O, a) -O indicate that the training samples (O, a) were collected from the non-expert samples O. S (-) is the composite scorer in equation (4), S (o, a) is used to output the similarity degree of (o, a) and expert sample, λ is a hyper-parameter adjusted by human,

indicating the expected value. Updating the parameters of S (-) according to the above formula, i.e. adjusting alpha₁～α₁₀And beta₁～β₃. The above equation (5) shows that the parameters in S (-) are adjusted with the objective of obtaining the minimum value by the equation on the right side of min.

After adjustment, the scorer can output a higher value after meeting the data in the expert sample, and output a lower value after meeting the data in the non-expert sample, thereby realizing the differentiation of the training samples. However, there will be many data similar to the expert sample in the non-expert sample, and it is obviously unreasonable to uniformly consider the data in the non-expert sample as poor-performing data. Therefore, a new scorer needs to be designed for mining data that performs better in non-expert samples.

In one embodiment, in order to mine better performing data in the non-expert sample, i.e., mining-class expert data, the distribution function of the driving simulation strategy is used as part of the input to the scorer, i.e., the scorer contains the distribution function of the driving simulation strategy. Specifically, the expert-like samples in the non-expert samples are determined through a preset scorer, and the implementation process is as follows: inputting each training sample into a scorer; adjusting adjustable parameters of the scorer according to the conditional probability distribution values generated by the distribution function and respectively corresponding to each training sample; and determining the expert-like samples in the non-expert samples through the adjusted scorer.

In one specific embodiment, the scorer is learned by using the following equation (6), i.e., principle adjustment α₁～α₁₀And beta₁～β₃：

Where μ denotes the driving simulation strategy and μ (a | o) is a conditional probability distribution representing a conditional probability distribution function of the sample driving action a with respect to the driving state o, i.e. a distribution function of the driving simulation strategy. If the driving simulation strategy mu is very similar to the expert strategy (i.e. the assumed accident-free driving strategy), then mu (ajo) outputs a value approaching 1 when the training sample is similar to the expert sample, and conversely outputs a value approaching 0. Updating the parameters of S (-) according to the above formula, namely, the formula (6) shows that the parameters in S (-) are adjusted by taking the formula on the right side of min as the target to obtain the minimum value, so that the scorer can see the sample similar to the expert sample and then output a higher value, and see the data greatly different from the expert sample and then output a lower value, thereby the scorer can easily mine the expert data in the non-expert sample.

In one embodiment, after the adjustable parameters in the scorer are adjusted based on the above embodiment, the adjusted scorer is used to determine the expert-like sample in the non-expert sample, and the specific process is as follows: inputting each training sample into the adjusted scorer respectively to obtain a scoring result corresponding to each training sample; and determining the non-expert sample with the scoring result larger than the preset scoring threshold value as an expert-like sample.

In one embodiment, after the adjustable parameters in the scorer are adjusted based on the above embodiments, the driving simulation strategy is completed by using the adjusted scorer, that is, the driving simulation strategy includes the adjusted scorer. The training weight of the class expert sample in the driving simulation strategy is improved, and the specific process is as follows: and according to the scoring result corresponding to each training sample, improving the training weight of the expert-like sample in the driving simulation strategy, wherein the larger the scoring result corresponding to the training sample is, the larger the training weight corresponding to the training sample is.

In one specific embodiment, equation (6) may be viewed as a functional with respect to the driving simulation strategy μ, which may be used as an optimization variable from the viewpoint of functional analysis. The first derivative of μ is calculated according to equation (6) and combined with the optimization objective of equation (4) to obtain the optimization objective as shown in equation (7).

Where S is an abbreviation for S (o, a, log μ (a | o)), and δ is a manually adjusted hyperparameter. The above formula (7) shows that the driving model is adjusted by taking the minimum value obtained by the formula on the right side of min as the targetParameters in the strategy mu are imitated. Because equation (7) is transformed from equation (6), using the optimization objective shown in equation (7), namely the driving simulation strategy μ, is equivalent to having a preference to simulate the expert sample and the class expert sample because

The weight of expert data in the non-expert sample can be automatically increased, the weight of data which is different from the expert sample in the non-expert sample by a large amount is automatically reduced, and the expert data in the non-expert sample is conveniently mined.

And 103, performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.

In this embodiment, after the training weights of the expert-like samples in the driving simulation strategy are increased, simulation training is performed on the automatically driven vehicle according to the driving simulation strategy with the adjusted weights through the training samples.

In one embodiment, each training sample includes a corresponding sample driving action and sample driving state. Through training the sample, according to the driving simulation strategy after adjusting the weight, imitate the training to the automatic driving vehicle, the concrete implementation process is: the following processing is performed for each training sample: inputting the sample driving action into the driving simulation strategy after the weight is adjusted; acquiring an actual driving state of an automatically driven vehicle after the vehicle automatically drives according to a driving simulation strategy; and comparing the actual running state with the sample running state, and adjusting the internal parameters of the driving simulation strategy when the comparison result does not accord with the preset state error until the comparison result accords with the preset state error.

In one embodiment, a phased training method may be used when training an autonomous vehicle. Specifically, after the automatic driving vehicle is subjected to simulation training through a training sample according to the driving simulation strategy after the weight is adjusted, the training times of the automatic driving vehicle are obtained; when the training times are determined to be smaller than the times threshold value, acquiring a new training sample set; adjusting the driving simulation strategy again through a new training sample set; and performing simulation training again on the automatic driving vehicle according to the driving simulation strategy after the adjustment again through the new training sample set.

In one specific example, as shown in FIG. 3, the entire training process for an autonomous vehicle is as follows:

step 301, initializing parameters of a driving simulation strategy and a scoring device;

step 302, extracting a training sample set from a sample database;

step 303, learning and updating a scorer S (-) which comprises a distribution function of the driving simulation strategy;

step 304, updating the driving simulation strategy mu by using the updated scorer S (-);

step 305, judging whether the training times are smaller than a time threshold, if so, executing step 302, and if not, executing step 306;

step 306, ending the training.

The automatic driving training method provided by the invention obtains a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample. And then, determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in the driving simulation strategy, wherein the training result of the class expert sample and the training result of the expert sample are within a preset error. And finally, performing simulation training on the automatic driving vehicle through a training sample according to the driving simulation strategy after the weight is adjusted. That is to say, for the expert-like sample with good training effect existing in the non-expert sample, the training weight of the expert-like sample in the non-expert sample can be improved through the above process, and because the training results of the expert-like sample and the expert sample are within the preset error range, the above process is equivalent to increasing the number of the expert samples, improving the safety in the training process of the automatic driving vehicle, and improving the training efficiency at the same time. In addition, the process determines and processes the expert-like samples in the non-expert samples, and compared with a mode that all the non-expert samples use the same training strategy, the utilization rate of the non-expert samples is higher, and the training effect of the automatic driving vehicle is further improved.

In the invention, the score output by the scorer is used for automatically adjusting the weight of the expert-like sample in the simulation learning process. If the scorer shows that the driving simulation strategy is similar to the expert strategy, the weight of the driving simulation strategy is increased, and the driving simulation strategy is encouraged to better simulate the expert strategy. Conversely, if the scorer indicates that the driving simulation strategy differs significantly from the expert strategy, its weight is reduced, preventing the driving simulation strategy from being updated in a direction away from the expert strategy.

In addition, the invention adopts a supervised learning framework to introduce only one scorer for judging the similarity degree of the sample and the expert sample, only adds the supervised learning of the scorer in addition to the original simulated learning, does not need to define a proper reward function, and has low realization difficulty and small calculation complexity. Meanwhile, the driving simulation system is modified on the basis of the existing scorer, the distribution function of the driving simulation strategy is used as an additional input signal, and compared with the existing scorer, the driving simulation system has the capability of evaluating the similarity degree of a training sample and an expert sample, so that the expert-like data in a non-expert sample can be mined, and the dependence of simulation learning on the number of the expert samples is reduced. In addition, the method does not need extra human resources to label the training samples at all, does not need extra environmental interaction, has low cost and high safety, further improves the safety in the automatic driving process and improves the training efficiency. Moreover, if the method is expanded and applied to various application scenes with scarce expert samples and high cost, such as medical diagnosis and treatment, industrial optimization, robots, logistics scheduling and the like similar to the automatic driving training scene, powerful support is provided for the application of various scenes, the landing cost is reduced to a great extent, and the method has great economic potential.

The following describes the automatic driving training device provided by the present invention, and the automatic driving training device described below and the automatic driving training method described above may be referred to in correspondence with each other. The repetition is not described in detail. As shown in fig. 4, the automatic driving training device includes:

an obtaining module 401, configured to obtain a training sample set required by automatic driving training, where the training sample set includes at least one expert sample and at least one non-expert sample;

an adjusting module 402, configured to determine, through a preset scorer, a class expert sample in each non-expert sample, and improve a training weight of the class expert sample in the driving simulation strategy, where the scorer is configured to grade each training sample in a training sample set, and a training result of the class expert sample and a training result of the expert sample are within a preset error;

and a training module 403, configured to perform simulation training on the autonomous vehicle according to the driving simulation strategy with the adjusted weights through the training samples.

In one embodiment, the adjusting module 402 is specifically configured to input each training sample into the scorer; adjusting adjustable parameters of the scorer according to the conditional probability distribution values generated by the distribution function and respectively corresponding to each training sample; and determining an expert-like sample in the non-expert sample through the adjusted scorer, wherein the scorer comprises a distribution function of the driving simulation strategy.

In one embodiment, the adjusting module 402 is specifically configured to input each training sample to the adjusted scorer, and obtain a scoring result corresponding to each training sample; and determining the non-expert sample with the scoring result larger than the preset scoring threshold value as an expert-like sample.

The adjusting module 402 is specifically configured to increase the training weight of the expert-like sample in the driving simulation strategy according to the scoring result corresponding to each training sample, where the greater the scoring result corresponding to the training sample is, the greater the training weight corresponding to the training sample is, and the driving simulation strategy includes an adjusted scorer.

In an embodiment, the obtaining module 401 is specifically configured to perform the following processing on each training sample: inputting the sample driving action into the driving simulation strategy after the weight is adjusted; acquiring an actual driving state of an automatically driven vehicle after the vehicle automatically drives according to a driving simulation strategy; and comparing the actual running state with the sample running state, and adjusting the internal parameters of the driving simulation strategy when the comparison result does not accord with the preset state error until the comparison result accords with the preset state error.

In one embodiment, the adjusting module 402 is specifically configured to obtain training times of the autonomous vehicle after performing simulation training on the autonomous vehicle according to the driving simulation strategy with the adjusted weight by using a training sample; when the training times are determined to be smaller than the times threshold value, acquiring a new training sample set; adjusting the driving simulation strategy again through a new training sample set; and performing simulation training again on the automatic driving vehicle according to the driving simulation strategy after the adjustment again through the new training sample set.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)501, a communication Interface (Communications Interface)502, a memory (memory)503, and a communication bus 504, wherein the processor 501, the communication Interface 502, and the memory 503 are configured to communicate with each other via the communication bus 504. Processor 501 may call logic instructions in memory 503 to perform an automated driving training method comprising: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error; and performing simulation training on the automatic driving vehicle through the training sample according to the driving simulation strategy after the weight is adjusted.

In addition, the logic instructions in the memory 503 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the automatic driving training method provided by the above methods, the method comprising: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error; and performing simulation training on the automatic driving vehicle through the training sample according to the driving simulation strategy after the weight is adjusted.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an automated driving training method provided by the above methods, the method comprising: acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample; determining a class expert sample in each non-expert sample through a preset scorer, and improving the training weight of the class expert sample in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and the training result of the class expert sample and the training result of the expert sample are within a preset error; and performing simulation training on the automatic driving vehicle through the training sample according to the driving simulation strategy after the weight is adjusted.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An automated driving training method, comprising:

acquiring a training sample set required by automatic driving training, wherein the training sample set comprises at least one expert sample and at least one non-expert sample;

determining expert-like samples in the non-expert samples through a preset scorer, and improving training weights of the expert-like samples in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and training results of the expert-like samples and training results of the expert samples are within a preset error;

and performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.

2. The automated driving training method of claim 1, wherein the scorer comprises a distribution function of the driving simulation strategy;

the determining, by a preset scorer, the expert-like samples in the non-expert samples includes:

inputting each of the training samples into the scorer;

adjusting the adjustable parameters of the scorer according to the conditional probability distribution values generated by the distribution function and respectively corresponding to each training sample;

determining the expert-like samples in the non-expert samples through the adjusted scorer.

3. The automated driver training method of claim 2, wherein the determining the expert-like sample of the non-expert samples by the adjusted scorer comprises:

inputting each training sample into the adjusted scorer respectively to obtain a scoring result corresponding to each training sample;

and determining the non-expert sample with the scoring result larger than a preset scoring threshold value as the class expert sample.

4. The automated driving training method of claim 3, wherein the driving simulation strategy includes the adjusted scorer;

the method for improving the training weight of the expert-like sample in the driving simulation strategy comprises the following steps:

and according to the scoring result corresponding to each training sample, improving the training weight of the expert-like sample in the driving simulation strategy, wherein the larger the scoring result corresponding to the training sample is, the larger the training weight corresponding to the training sample is.

5. The automated driving training method according to claim 1, wherein after performing simulation training on the automated driving vehicle according to the driving simulation strategy with the adjusted weight by using the training samples, the method further comprises:

acquiring training times of the automatic driving vehicle;

when the training times are determined to be smaller than a time threshold value, acquiring a new training sample set;

adjusting the driving simulation strategy again through the new training sample set;

and performing simulation training on the automatic driving vehicle again according to the driving simulation strategy after the adjustment again through a new training sample set.

6. The automated driving training method of claim 1, wherein each of the training samples comprises a corresponding sample driving action and a sample driving state;

the simulation training of the automatic driving vehicle is carried out according to the driving simulation strategy after the weight is adjusted through the training sample, and the simulation training comprises the following steps:

performing the following processing on each training sample:

inputting the sample driving action to the weighted driving simulation strategy;

acquiring an actual driving state of the automatic driving vehicle after automatic driving according to the driving simulation strategy;

and comparing the actual running state with the sample running state, and when the comparison result does not accord with a preset state error, adjusting the internal parameters of the driving simulation strategy until the comparison result accords with the preset state error.

7. An automated driving training apparatus, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a training sample set required by automatic driving training, and the training sample set comprises at least one expert sample and at least one non-expert sample;

the adjusting module is used for determining expert-like samples in the non-expert samples through a preset scorer and improving training weights of the expert-like samples in a driving simulation strategy, wherein the scorer is used for grading each training sample in a training sample set, and training results of the expert-like samples and training results of the expert samples are within a preset error;

and the training module is used for performing simulation training on the automatic driving vehicle according to the driving simulation strategy after the weight is adjusted through the training sample.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the automated driving training method according to any of claims 1 to 6 are implemented when the program is executed by the processor.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the automated driving training method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the automatic drive training method according to any one of claims 1 to 6 when executed by a processor.