CN109740738B

CN109740738B - Neural network model training method, device, equipment and medium

Info

Publication number: CN109740738B
Application number: CN201811645093.8A
Authority: CN
Inventors: 申俊峰; 周大军; 张力柯; 荆彦青
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2022-12-16
Anticipated expiration: 2038-12-29
Also published as: CN109740738A

Abstract

The embodiment of the application discloses a neural network model training method and device, aiming at a learning object needing reinforcement learning, an artificial sample set generated by the learning object according to user operation can be obtained manually, and a machine sample set obtained by autonomous learning of the neural network model of the learning object in the learning object can be obtained. When training this neural network model, can be according to above-mentioned artificial sample set and machine sample set as the training basis, because the artificial sample that has included the manual work production in the training sample set that is used for the training, this artificial sample is higher for the machine sample quality that the machine learning initial stage obtained, more has the purpose of promoting this study object's completion progress, it is mostly the meaningful interaction with the study object to say for machine sample, thereby can shorten the model parameter convergence duration of training earlier stage, the time of training neural network model has been reduced.

Description

Neural network model training method, device, equipment and medium

Technical Field

The present application relates to the field of neural networks, and in particular, to a neural network model training method, apparatus, device, and computer-readable storage medium.

Background

Reinforcement learning is also called trial and error learning, and is a machine learning algorithm which enables an agent to continuously interact in the environment (environment) of a learning object and learn according to the feedback excitation (reward) of the environment, and the learning algorithm is not based on any prior knowledge and can completely and autonomously learn. Different agents may exist according to different learning objects, for example, when the learning object is a game, the agent may be a character, a participant, or the like in the game.

When a traditional reinforcement learning such as Deep Q Network (DQN) trains a neural Network model of itself, data obtained by autonomous learning completely according to a machine is used as training data.

The training data under the scene are obtained by machine self-trial and error, and particularly, the machine self-trial and error speed is low in the early stage of training, and meaningless interaction is more, so that the convergence time of the model parameters in the early stage of training is long, the cost is high, and the time for training the neural network model is prolonged.

Disclosure of Invention

In order to solve the above technical problems, the present application provides a method and an apparatus for training a neural network model, so as to shorten the convergence time of model parameters in an early stage of training and reduce the time for training the neural network model.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides a neural network model training method, where the method includes:

acquiring an artificial sample set generated by a learning object according to user operation;

acquiring a neural network model aiming at the learning object, and autonomously learning in the learning object to obtain a machine sample set;

training the neural network model according to the artificial sample set and the machine sample set.

In a second aspect, an embodiment of the present application provides a neural network model training apparatus, where the apparatus includes a first obtaining unit, a second obtaining unit, and a training unit:

the first acquisition unit is used for acquiring an artificial sample set generated by a learning object according to user operation;

the second acquisition unit is used for acquiring a machine sample set obtained by autonomous learning of a neural network model aiming at the learning object in the learning object;

the training unit is used for training the neural network model according to the artificial sample set and the machine sample set.

In a third aspect, an embodiment of the present application provides an apparatus for neural network model training, where the apparatus includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the neural network model training method of the first aspect according to instructions in the program code.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program code for executing the neural network model training method described in the first aspect.

According to the technical scheme, aiming at a learning object needing reinforcement learning, an artificial sample set generated by the learning object according to user operation and a machine sample set obtained by autonomously learning the neural network model of the learning object in the learning object can be obtained manually. When training this neural network model, can be according to above-mentioned artificial sample set and machine sample set as the training basis, because the artificial sample that has included the manual work production in the training sample set that is used for the training, this artificial sample is higher for the machine sample quality that the machine learning initial stage obtained, more has the purpose of promoting this study object's completion progress, it is mostly the meaningful interaction with the study object to say for machine sample, thereby can shorten the model parameter convergence duration of training earlier stage, the time of training neural network model has been reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of a neural network model training method provided in an embodiment of the present application;

fig. 2 is a flowchart of a neural network model training method provided in an embodiment of the present application;

fig. 3 is an exemplary diagram of a QQ flying game as a learning object provided in an embodiment of the present application;

fig. 4 is a schematic flow structure diagram of a neural network model training method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a flow of neural network model pre-training provided in the embodiment of the present application;

fig. 6 is a flowchart of a neural network model training method according to an embodiment of the present disclosure;

fig. 7a is a block diagram of a neural network model training apparatus according to an embodiment of the present disclosure;

fig. 7b is a structural diagram of a neural network model training device according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of an apparatus for neural network model training provided in an embodiment of the present application;

fig. 9 is a block diagram of an apparatus for neural network model training according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

When the traditional reinforcement learning is used for training the neural network model of the self, training data are completely obtained by machine self trial and error, particularly, the machine self trial and error speed is low in the early stage of training, the meaningless interaction is more, the quality is relatively poor, the convergence time of model parameters in the early stage of training is long, the cost is high, and the time for training the neural network model is prolonged.

In order to solve the above technical problem, an embodiment of the present application provides a neural network model training method. Because the artificially generated artificial samples have higher quality compared with the machine samples obtained at the initial stage of machine learning, the artificially generated artificial samples have the purpose of promoting the completion progress of the learning object, and most of the artificially generated artificial samples are meaningful interaction with the learning object compared with the machine samples, the artificially generated artificial samples are intensively introduced into the training samples adopted by the training model in the training method provided by the embodiment of the application so as to assist the machine samples in training the neural network model.

The neural network model training method provided by the embodiment of the application can be applied to data processing equipment, such as terminal equipment, a server and the like. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like; the server may be specifically an independent server or a cluster server.

In order to facilitate understanding of the technical solution of the present application, the neural network model training method provided in the embodiments of the present application is described below with reference to an actual application scenario and taking a data processing device as a server as an example.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a neural network model training method provided in an embodiment of the present application, where the application scenario may include at least one terminal device 101 and a server 102, and a learning object may be configured in the terminal device 101, and an artificial sample set and a machine sample set may be generated by operating the terminal device 101. The server 102 may obtain, from the terminal device 101, an artificial sample set generated by a learning object according to a user operation and a machine sample set obtained by autonomously learning in the learning object for a neural network model of the learning object, so as to train the neural network model.

The neural network model can be trained, and can also be continuously trained in the using process. The learning object can be an object applied to deep learning, and the learning object can have a given task and needs to complete the task or promote the task progress by implementing a skill action under a certain rule. The learning object may be, for example, a game, a sport item, or the like, and if a game is used as the learning object, the learning object performs reinforcement learning on the neural network model of the learning object with a game playing skill, and if a sport item, for example, a jump item is used as the game playing skill, the learning object performs reinforcement learning on the neural network model of the learning object with a jump skill under different environments.

The samples are used to train a neural network model. In this embodiment, the sample may comprise an artificial sample, a machine sample, or a training sample. The types of data included in the manual sample, the machine sample, and the training sample are consistent.

Each artificial sample is generated by a learning object according to user operation, and a plurality of artificial samples form an artificial sample set; each machine sample is obtained by self-learning in the learning object aiming at the neural network model of the learning object, and a plurality of machine samples form a machine sample set; the training samples may be samples selected from an artificial sample set and a machine sample set when the neural network model is selected, and the training sample set is formed by a plurality of training samples.

The artificial sample set is data artificially generated for the learning object based on a priori knowledge of human beings, and the artificial sample set may be pre-arranged or continuously generated by a user on the terminal device 101 through performing an operation on the learning object. The neural network model is not trained at the initial stage of the autonomous learning, model parameters of the neural network model may be some initial values, and most of machine samples obtained by the autonomous learning of the neural network model at the moment are meaningless interactive with a learning object. Therefore, the quality of the artificial sample is higher than that of the machine sample obtained at the initial stage of the machine learning, and the goal of advancing the completion progress of the learning object is further achieved.

Therefore, the server 102 trains the neural network model for the learning object according to the acquired artificial sample set and the machine sample set, and due to the addition of the artificial sample set with higher quality, the goal of promoting the completion progress of the learning object is achieved when the neural network model is trained, so that the model parameter convergence duration in the early stage of training can be shortened, and the time for training the neural network model is shortened.

The neural network model training method provided by the embodiment of the application can be applied to application scenes of Artificial Intelligence (AI) training, such as games, robots, industrial automation, education, medicine, finance and the like. For convenience of introduction, the following embodiments will be described with an application scenario (game as a learning object) of an AI of a training game.

Next, a neural network model training method provided by the embodiments of the present application will be described with reference to the drawings.

Referring to fig. 2, fig. 2 shows a flow chart of a neural network model training method, the method comprising:

s201, acquiring an artificial sample set generated by the learning object according to user operation.

In the application scene of the AI training game, the game is used as a learning object, and a user can play the game according to the game rules of the game, so that an artificial sample set generated according to the operation of the user is generated. Different games the type of data included in the manual sample set is generally consistent, although the rules of the game may vary.

The generated artificial sample set may be saved, for example, in an artificial sample pool, so that the artificial sample set may be obtained from the artificial sample pool when the artificial sample set is needed to assist in training the neural network model. S202, obtaining a machine sample set obtained by the autonomous learning of the neural network model aiming at the learning object in the learning object.

The set of machine samples generated in the learning object by the neural network model may be saved, for example, in a replay memory (replay memory), so that when the neural network model needs to be trained by using the set of machine samples, the set of machine samples may be obtained from the replay memory. The machine sample set may include data types consistent with the artificial sample set.

In one possible implementation, the machine sample set and the artificial sample set both include the actions performed in the learning object, the environmental parameters of the learning object when performing the actions, and the feedback parameters of the learning object after performing the actions.

The actions implemented in the learning object may be actions allowed by the learning object implemented in the learning object in different ways, for example, actions implemented in the learning object by user operations or actions implemented in the learning object by the neural network model.

The environment parameter of the learning object when the action is performed may be used to identify an environment provided when the action is performed, for example, when a game is used as the learning object, the environment parameter may be a game screen when the action is performed.

The feedback parameter of the learning object after the action is performed may be a feedback generated by the learning object when the action is performed to interact with the learning object, and the feedback parameter may play a role in evaluating an influence of the action performed in the learning object on the progress of the propulsion task.

For example, when the learning object is a running game, it is given a task to avoid the block of the obstacle from moving to the end point under a certain rule. The user observes that the current environment of the learning object comprises a short wall, if the action implemented by the user based on the current environment is jumping, the game role jumps over the short wall to advance at a high speed, and therefore the action prevents the game role from being blocked by obstacles to continue moving forward, and the progress of a task is facilitated to be advanced. Therefore, the learning object generates a higher feedback parameter for the action. If the action implemented by the user is to normally run towards the short wall, the game role collides with the short wall to generate stagnation, thereby hindering the forward movement and being not beneficial to promoting the task progress, and therefore, the feedback parameter generated by the learning object aiming at the action is lower.

The artificial sample set comprises the corresponding relation among the action implemented by the user in the learning object, the environment parameter of the learning object when the action is implemented and the feedback parameter of the learning object after the action is implemented. The corresponding relation included in the manual sample set can reflect the action selected and implemented by the user in any environment of the learning object and the influence of the action implemented by the user on the progress of the completion of the propulsion task.

The machine sample set comprises corresponding relations among actions implemented in the learning object through the neural network model, environment parameters of the learning object when the actions are implemented, and feedback parameters of the learning object after the actions are implemented. The corresponding relation included in the machine sample set can reflect the action selected and implemented in any environment of the learning object through the neural network model and the influence of the action implemented through the neural network model on the progress of the completion of the propulsion task.

The neural network model is used for autonomously learning the skills in the learning object, the specific neural network model can determine the actions to be implemented in the learning object according to the environmental parameters, each determined action respectively has a corresponding feedback parameter, and the action implemented under the environmental parameters can be selected based on the feedback parameters, so that the completion progress of the learning object can be promoted.

It should be noted that, because the neural network model is not trained yet in the initial stage of autonomous learning, the model parameters of the neural network model may still be some initial values, and it is not known which actions are favorable for advancing the completion progress of the learning object through the neural network model at this time, so that there is a high possibility that the action selected in the initial stage of autonomous learning is meaningless, for example, the action selected to be implemented is irrelevant to the completion progress of the learning object or the action selected to be implemented makes the progress of the learning object unfinishable, so that there are many meaningless interactions between the initial stage of autonomous learning and the learning object. For example, if the learning object is a running game, if the action performed in the learning object by the neural network model to advance the completion progress of the learning object should be a forward walking, but since the neural network model is not trained yet, the action selected to be performed in the autonomous learning by the neural network model may be a left walking, but the left walking does not actually make sense of advancing the completion progress of the learning object, and in this case, a machine sample including the action "left walking" performed in the learning object by the neural network model is meaningless interaction.

Therefore, the neural network model is trained by combining the artificial sample set and the machine sample set in the subsequent process of training the neural network model.

The samples in the artificial sample set or the machine sample set may be represented in a certain data format, for example, the following data format may be used: (environmental parameters of the learning object at the time of performing the action, actions to be performed on the learning object, and feedback parameters of the learning object after performing the action).

The feedback parameters of the learning object after the action is performed may be used to evaluate the influence of the action performed in the learning object on the progress of the progress task, and the feedback parameters may include at least one of reward parameters obtained by performing the target action and environmental parameters obtained by performing the target action in the learning object. The target action may be an artificial sample set or a machine sample set, and any action included in any target sample.

Both of these aspects may serve to evaluate the impact of actions performed in the learning object on the progress of the completion of the propulsion task.

In one implementation, the feedback parameters include reward parameters (rewarded) resulting from implementing the target action.

In many cases, the target action is implemented differently, the reward obtained by implementing the target action is different, the better the reward obtained by implementing the target action is, the more beneficial the implemented target action is to promote the completion progress of the learning object, wherein the quality of the reward can be represented by reward parameters.

In one implementation, the feedback parameters include environmental parameters derived in the learning object to implement the target action.

In many cases, the target action to be performed differs, and the obtained environment after the target action is performed may differ, and the environment after the target action is performed may be represented by the environment parameter obtained in the learning object by the target action. The quality of the environmental parameters can reflect the influence of the action on the progress of the learning object.

Taking the QQ racing game shown in fig. 3 as an example, if the executed action is drift, and if the racing car may hit a guardrail next to the car and the game is terminated after the execution of the action, the environment in which the racing car hits the guardrail next to the car is the environment after the execution of the drift action, the completion progress of the drift action which is not favorable for advancing the learning object can be determined by the environment parameters corresponding to the environment.

Based on this, in another implementation manner of the embodiment, in order to avoid excessive pursuit of the corresponding parameter, such as the reward parameter, on the one hand, and to ignore the corresponding parameter, such as the environmental parameter, on the other hand, the influence of the target action evaluated according to the reward parameter or the environmental parameter on the progress of the completion of the learning object is not accurate enough. Therefore, for the executed action, for example, the target action, the influence of the target action on the progress of the learning object can be evaluated by comprehensively considering both the reward parameter obtained after the target action is executed and the environmental parameter obtained in the learning object according to the executed target action, that is, for any one target action in the artificial sample set or the machine sample set, the feedback parameter of the learning object after the target action is executed includes the reward parameter obtained after the target action is executed and the environmental parameter obtained in the learning object according to the executed target action. The two aspects are used as feedback parameters together, so that the influence of the target action on the progress of advancing the learning object can be evaluated more accurately.

Based on the feedback parameters including reward parameters derived from performing the target action and environmental parameters derived from performing the target action in the learning object, the samples in the artificial sample set or the machine sample set may be in the following data format: (s, a, r, s'). Where s denotes an environmental parameter of the learning object when the target motion is performed, a denotes the target motion performed in the learning object, r denotes an incentive parameter obtained by performing the target motion, and s' denotes an environmental parameter obtained in the learning object according to the performed target motion.

S203, training the neural network model according to the artificial sample set and the machine sample set.

When training the neural network model, samples can be selected from the artificial sample set and the machine sample set to obtain a training sample set for training the neural network model, and the neural network model is trained by using the training sample set in N rounds, wherein the training sample set adopted in any one round of each round of training or the previous M rounds of training can comprise a part of artificial samples in the artificial sample set and a part of machine samples in the machine sample set.

In a possible implementation manner, if the samples in the artificial sample set or the machine sample set may adopt the following data format (s, a, r, s'), in the training process, the neural network model may be trained by using a preset algorithm, for example, a preset gradient optimization algorithm, according to the loss function (loss function), until the trained neural network model achieves a better effect. The loss function can be shown as equation (1):

y＝r+γ*max _a Q(s’，a)

loss＝(y-Q(s，a)) ² (1)

wherein s represents an environmental parameter of the learning object when the target action is performed; a represents a target action to be performed in a learning object; r represents a reward parameter resulting from implementing the target action; s' represents an environmental parameter obtained in the learning object in accordance with the implementation target motion; q (s, a) represents the actual value of the action a implemented in the corresponding learning object under the environment parameter s output by the neural network model; q (s ', a) represents the actual value of the action a performed in the corresponding learning object under the environment parameter s' output by the neural network model; γ is the discount coefficient of the value Q (s', a), typically set to 0.99; y represents the theoretical value of the action a performed on the corresponding learning object under the environment parameter s output by the neural network model, and loss represents the loss of the actual value with respect to the theoretical value.

The model parameters of the neural network model can be adjusted according to the loss until the adjusted model parameters make the neural network model better.

If the samples in the artificial sample set or the machine sample set can adopt the following data format (s, a, r, s '), the input of the neural network model is the environment parameter s of the learning object when the target action is implemented, the output of the neural network model is the action a implemented in the learning object, the action a can be determined according to the size of r and s ', and generally, the action a corresponding to better r and s ' is selected as the output of the neural network model. The trained neural network model can be used in a machine for playing a game, so that the machine can be used for playing the game to realize the detection of game faults.

The trained neural network model can provide corresponding action capable of promoting the completion progress of the learning object according to any environmental parameter, so that when the neural network model is used, after a certain environmental parameter is input into the neural network model, the action determined by the neural network model can also be the action capable of promoting the completion progress of the learning object.

According to the technical scheme, aiming at a learning object needing reinforcement learning, an artificial sample set generated by the learning object according to user operation and a machine sample set obtained by the neural network model of the learning object in the learning object through autonomous learning can be obtained manually. When training this neural network model, can be according to above-mentioned artificial sample set and machine sample set as the training foundation, because the artificial sample that has included the manual work production in the training sample set that is used for the training, this artificial sample is higher for the machine sample that machine learning initial stage obtained, more has the purpose of promoting this learning object's completion progress, relative machine sample says mostly the meaningful interaction with the learning object, thereby can shorten the model parameter convergence time of training earlier stage, the time of training neural network model has been reduced.

It can be understood that the obtained machine sample set is mainly stored in the replay memory, and the quality of the machine sample set in the replay memory will have a great influence on the quality of the machine sample used for training the neural network model. Next, a description will be given of a machine sample set in the playback memory.

In this embodiment, the neural network model may be trained, and the trained neural network model may be used to obtain data through autonomous learning, and as time goes by, the trained neural network model becomes better and better, and then, the quality of the data obtained through autonomous learning by using the trained neural network model may also become higher and higher. In this case, in order to improve the quality of the machine sample set used for training the neural network model, in the process of training the neural network model, data obtained by autonomous learning of the neural network model obtained through training may be used as a machine sample, and supplemented to a reproduction memory to be added to the machine sample set, as shown in fig. 4. Wherein, (s, a, r, s') is data obtained by autonomous learning of a neural network model obtained through training. When the set of machine samples in the replay memory reaches the capacity of the replay memory, the machine sample that is first stored in the replay memory may be deleted.

As the training is continuously carried out, the neural network model obtained by the training is better and better, the quality of the data obtained by the autonomous learning of the neural network model is higher and higher, the data with higher quality is used as a machine sample to be added into the machine sample set, the quality of the machine sample set in the replay memory can be improved, the machine sample set is more favorable for promoting the completion progress of the learning object, the convergence time of the model parameters can be shortened, and the time for training the neural network model is shortened.

It should be noted that, the embodiment corresponding to fig. 2 introduces a training method of a neural network model, and a training sample set used for training the neural network model may be composed of samples selected from the artificial sample set obtained in S201 and the machine sample set obtained in S202, and in order to ensure the quality of the training sample set, the training sample set needs to include artificial samples, but according to an application scenario of autonomous learning, it is difficult to train the neural network model completely according to the artificial samples in the whole process of the neural network model, and even the artificial samples in the training sample set cannot be too many, so the training sample set needs to include two parts, namely, the artificial samples and the machine samples.

Therefore, one possible implementation of S203 is: and respectively selecting samples from the artificial sample set and the machine sample set to obtain a training sample set for training the neural network model, wherein the number of the artificial samples in the training sample set and the number of the machine samples in the training sample set meet a preset ratio. Then, the neural network model is trained according to the training sample set. The samples included in the training sample set may be referred to as training samples.

Next, how to determine the preset ratio between the number of artificial samples and the number of machine samples in the training sample set will be described.

In this embodiment, the determination manner of the preset ratio may include multiple manners, and this embodiment will mainly take two manners as examples.

The first way of determining the preset ratio may be to determine the preset ratio of two samples in a training sample set used for the current training according to a training result of a previous training, that is, the preset ratio between the number of artificial samples and the number of machine samples in the training sample set used for the nth training is determined according to a training result of the nth-1 training on the neural network model.

The training result of the (N-1) th training can be a neural network model obtained by the (N-1) th training, when the neural network model obtained by the (N-1) th training is used, actions can be output according to the neural network model, and the preset proportion is adjusted according to the influence of the actions output by the neural network model on the progress of the propelled learning object (game).

For example, when a game is played by using the neural network model obtained by the (N-1) th training, the action which is not beneficial to promoting the completion progress of the learning object (game) is output according to the neural network model, so that the completion progress of the learning object is low, and accordingly, the preset ratio between the number of the artificial samples and the number of the machine samples in the training sample set can be adjusted. Because the artificial samples are beneficial to promoting the completion progress of the learning object, in order to enable the action which is more beneficial to promoting the completion progress of the learning object to be output according to the neural network model when the neural network model obtained through training is used, the artificial samples in the training sample set for training the neural network model can be increased, namely the preset proportion is adjusted to increase the proportion of the artificial samples in the training sample set.

On the contrary, the number of the artificial samples in the training sample set for training the neural network model can be reduced, that is, the preset proportion is adjusted to reduce the proportion of the artificial samples in the training sample set.

The second way of determining the preset ratio may be to gradually reduce the ratio of the artificial samples in the training sample set, that is, the ratio of the artificial samples in the training sample set used for the nth training of the neural network model is smaller than the ratio of the artificial samples in the training sample set used for the (N-1) th training of the neural network model.

It is understood that, when the learning object is a game, the purpose of playing the game through the neural network model includes not only completing the progress of the game but also detecting a game failure, and in order to detect a game failure, it is necessary to perform various actions on one environment parameter through the neural network model to obtain an environment parameter (game screen) of the game after performing various actions, and to detect the environment parameter (game screen) after performing various actions, for example, to detect whether there is a place on the game screen that cannot be rendered after performing actions.

However, since the artificial sample set is generated based on the prior knowledge of human beings, the user basically selects a commonly used action according to the prior knowledge for one environmental parameter, and rarely tries other actions, so that the artificial sample set is relatively single and has poor diversity. For example, in a cool game, when the environmental parameter indicates that a cross bar appears on the course of the cool game, most people may jump over the cross bar when encountering the cross bar based on a priori knowledge of human beings, and then the user generally chooses to jump over the cross bar by this action. However, the action of sliding the cross bar may be sliding the cross bar downwards, but the user rarely or basically does not choose to slide the cross bar downwards due to the prior knowledge of human beings or the complicated operation of the action of sliding the cross bar downwards.

In order to ensure that different actions can be implemented for one environmental parameter when a game is played by using a neural network model so as to realize the detection of game faults, the training samples in the training sample set are required to have better diversity. Therefore, with the continuous increase of the training times of the neural network model, the proportion of the artificial samples in the training sample set can be gradually reduced, and the proportion of the machine samples is improved, so that the training samples in the training sample set have better diversity.

Therefore, the proportion of the artificial samples in the training sample set is gradually reduced, the diversity of the training samples in the training sample set can be improved, and when a game is played by using the neural network model obtained by training, different actions can be implemented according to one environmental parameter, so that the game fault detection is realized.

Next, how to train the neural network model for the learning object using the artificial sample set and the machine sample set in the neural network model training process will be described. According to the characteristics of the artificial sample set and the machine sample set, the artificial sample set and the machine sample set can be reasonably utilized in different neural network model training periods.

In one implementation, the neural network model may be pre-trained from an artificial sample set prior to performing S203. In the early stage of training the neural network model, the convergence of the model parameters of the neural network model is slower than that in the later stage, the change of the model parameters is large, and the obtained neural network model is very unstable. In order to quickly obtain a stable neural network model in the early stage of training the neural network model, the neural network model can be pre-trained only by using an artificial sample set with higher quality in the early stage of training the neural network model, so that the neural network model which is pre-trained is obtained. The neural network model after the pre-training is a relatively stable model, the change of the model parameters is relatively small, and then S203 is executed, namely the neural network model after the pre-training is finished is trained according to the artificial sample set and the machine sample set.

The schematic flow structure diagram of the neural network model pre-training can be shown in fig. 5, the artificial samples used for the neural network model pre-training can be selected from an artificial sample pool storing an artificial sample set, and the neural network model is pre-trained according to the selected artificial samples to obtain the neural network model which is completed with the pre-training.

Because the artificial sample set has the purpose of promoting the completion progress of the learning object, the stability time of the model parameters can be shortened, and the stable neural network model completing the pre-training can be quickly obtained. Therefore, when the pre-trained neural network model is trained and finished according to the artificial sample set and the machine sample set, the time for training the neural network model can be reduced.

The pre-training of the neural network model cannot be carried out all the time, and the pre-training is always finished so as to continuously train the neural network model which is finished with the pre-training according to the artificial sample set and the machine sample set. How to judge the completion of the pre-training of the neural network model will be described below.

This embodiment mainly introduces two determination methods, the first determination method is when the artificial sample set is trained. The pre-training of the neural network model adopts an artificial sample set, and when the artificial sample set is trained, no sample capable of continuing the pre-training of the neural network model exists, and at the moment, the pre-training can be considered to be completed.

The second judgment method is to complete the preset progress of the learning object through the neural network model. It can be understood that the purpose of pre-training the neural network model is to obtain a relatively stable neural network model, and if the stable neural network model is obtained, the neural network model can be considered to complete the pre-training, and whether the neural network model is stable can be reflected by the model parameters, i.e., the more the change of the model parameters is, the more stable the neural network model is, and otherwise, the more unstable the neural network model is. Then, how to measure whether the neural network model is stable can be determined by using the progress of completing the learning object through the neural network model, the larger the progress of completing the learning object through the neural network model is, the smaller the change of the model parameters is, the more stable the neural network model is, otherwise, the larger the change of the model parameters is, the more unstable the neural network model is.

For example, the learning object is a running game, the total length of the track of the running game is 1000 meters, if the game is terminated after 200 meters are finished on the track of the running game through the neural network model, the finished 200 meters are the progress of completing the learning object through the neural network model, and the progress is large, so that the neural network model can be considered to be stable; if the game is terminated after 5 meters are finished on the track of the running game through the neural network model, the finished 5 meters are the progress of the learning object completed through the neural network model, and the progress is small, so that the neural network model can be considered to be unstable.

Therefore, it can be preset how large the progress of completing the learning object through the neural network model is, and the neural network model is considered to be stable, for example, it is preset that the preset progress of completing the learning object through the neural network model is, and the neural network model is considered to be stable. Thus, when the preset progress of the learning object is completed through the neural network model, the neural network model is determined to complete the pre-training.

Next, the neural network model training method will be described with reference to specific application scenarios. In the application scenario, the neural network model obtained through training is used for playing games to detect game faults, and the games are learning objects when the neural network model is trained. In this application scenario, referring to fig. 6, the neural network model training method includes:

s601, playing a game on the terminal equipment by the user to obtain an artificial sample set.

And S602, acquiring an artificial sample set in the artificial sample pool.

The manual sample pool is constructed from a collection of manual samples generated by the user while playing the game.

And S603, pre-training the neural network model according to the obtained artificial sample set.

S604, judging whether the artificial sample set is trained or whether the preset progress of the game is finished through the neural network model, if so, executing S605, and if not, executing S603.

And S605, performing autonomous learning on the game through the neural network model which is trained in advance to obtain a machine sample set.

S606, acquiring an artificial sample set in the artificial sample pool and a machine sample set in the replay memory.

The replay memory is constructed from a set of collected machine samples.

S607, training the pre-trained neural network model according to the artificial sample set and the machine sample set.

The neural network model training process mainly comprises two stages, wherein the two stages are a pre-training stage and a stage for training the neural network model which is subjected to pre-training, S601-S604 can be used as the pre-training stage, and S605-S607 can be used as the stage for training the neural network model which is subjected to pre-training.

According to the technical scheme, aiming at a learning object needing reinforcement learning, an artificial sample set generated by the learning object according to user operation and a machine sample set obtained by autonomously learning the neural network model of the learning object in the learning object can be obtained manually. When training this neural network model, can be according to above-mentioned artificial sample set and machine sample set as the training foundation, because the artificial sample that has included the manual work production in the training sample set that is used for the training, this artificial sample is higher for the machine sample that machine learning initial stage obtained, more has the purpose of promoting this learning object's completion progress, relative machine sample says mostly the meaningful interaction with the learning object, thereby can shorten the model parameter convergence time of training earlier stage, the time of training neural network model has been reduced.

Based on the neural network model training method provided in the foregoing embodiment, this embodiment provides a neural network model training apparatus 700, referring to fig. 7a, the apparatus 700 includes a first obtaining unit 701, a second obtaining unit 702, and a training unit 703:

the first acquiring unit 701 is used for acquiring an artificial sample set generated by a learning object according to a user operation;

the second obtaining unit 702 is configured to obtain a machine sample set obtained by autonomously learning, in the learning object, a neural network model for the learning object;

the training unit 703 is configured to train the neural network model according to the artificial sample set and the machine sample set.

In one implementation, referring to fig. 7b, the apparatus 700 further comprises a pre-training unit 704:

the pre-training unit 704 is configured to pre-train the neural network model according to the artificial sample set;

the training unit 703 is specifically configured to:

training the neural network model which is pre-trained according to the artificial sample set and the machine sample set.

In one implementation, when the artificial sample set is trained or the preset schedule of the learning object is completed through the neural network model, the pre-training unit 704 determines that the neural network model completes pre-training.

In one implementation, the set of artificial samples includes a correspondence between actions performed in the learning object by user operations, environmental parameters of the learning object when the actions are performed, and feedback parameters of the learning object after the actions are performed;

the machine sample set comprises corresponding relations among actions implemented in the learning object through the neural network model, environment parameters of the learning object when the actions are implemented, and feedback parameters of the learning object after the actions are implemented.

In an implementation manner, the training unit 703 is specifically configured to:

respectively selecting samples from the artificial sample set and the machine sample set to obtain a training sample set for training the neural network model; the number of the artificial samples in the training sample set and the number of the machine samples meet a preset proportion;

and training the neural network model according to the training sample set.

In one implementation manner, the preset ratio between the number of the artificial samples and the number of the machine samples in the training sample set adopted for the Nth training of the neural network model is determined according to the training result of the (N-1) th training of the neural network model.

In one implementation, the proportion of the artificial samples in the training sample set used for the Nth training of the neural network model is smaller than the proportion of the artificial samples in the training sample set used for the (N-1) th training of the neural network model.

In an implementation manner, the second obtaining unit 702 is specifically configured to:

and in the process of training the neural network model, adding data obtained by autonomous learning in the learning object according to the neural network model into the machine sample set as machine samples.

In one implementation, for any one target sample in the artificial sample set or the machine sample set, the target sample includes feedback parameters of the learning object after the target action is performed, and the feedback parameters of the learning object after the target action is performed include reward parameters obtained by performing the target action and/or environmental parameters obtained in the learning object according to the target action.

The embodiment of the present application further provides an apparatus for training a neural network model, which is described below with reference to the accompanying drawings. Referring to fig. 8, an embodiment of the present application provides a device 800 for neural network model training, where the device 800 may be a server, may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 842 or data 844. Memory 832 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 822 may be provided in communication with the storage medium 830 for executing a series of instruction operations in the storage medium 830 on the apparatus 800 for neural network model training.

The apparatus 800 for neural network model training may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input-output interfaces 858, and/or one or more operating systems 841, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 8.

The CPU 822 is configured to execute the following steps:

acquiring a machine sample set obtained by autonomous learning of a neural network model for the learning object in the learning object;

Referring to fig. 9, an embodiment of the present application provides an apparatus 900 for neural network model training, where the apparatus 900 may also be a terminal apparatus, and the terminal apparatus may be any terminal apparatus including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the terminal apparatus is a mobile phone for example:

fig. 9 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device according to an embodiment of the present disclosure. Referring to fig. 9, the handset includes: a Radio Frequency (RF) circuit 910, a memory 920, an input unit 930, a display unit 940, a sensor 950, an audio circuit 960, a wireless fidelity (WiFi) module 970, a processor 980, and a power supply 990. Those skilled in the art will appreciate that the handset configuration shown in fig. 9 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following specifically describes each component of the mobile phone with reference to fig. 9:

the RF circuit 910 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for receiving downlink information of a base station and then processing the received downlink information to the processor 980; in addition, data for designing uplink is transmitted to the base station. In general, RF circuit 910 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 910 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Message Service (SMS), and the like.

The memory 920 may be used to store software programs and modules, and the processor 980 may execute various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 920. The memory 920 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 920 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 930 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 930 may include a touch panel 931 and other input devices 932. The touch panel 931, also called a touch screen, may collect touch operations of a user (e.g., operations of a user on or near the touch panel 931 by using any suitable object or accessory such as a finger, a stylus, etc.) thereon or nearby, and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 931 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 980, and can receive and execute commands sent by the processor 980. In addition, the touch panel 931 may be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 930 may include other input devices 932 in addition to the touch panel 931. In particular, other input devices 932 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 940 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The Display unit 940 may include a Display panel 941, and optionally, the Display panel 941 may be configured in a form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 931 may cover the display panel 941, and when the touch panel 931 detects a touch operation on or near the touch panel 931, the touch operation is transmitted to the processor 980 to determine the type of the touch event, and then the processor 980 provides a corresponding visual output on the display panel 941 according to the type of the touch event. Although in fig. 9, the touch panel 931 and the display panel 941 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 931 and the display panel 941 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 950, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 941 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 941 and/or backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing gestures of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometers and taps), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, the description is omitted here.

Audio circuitry 960, speaker 961, microphone 962 may provide an audio interface between a user and a cell phone. The audio circuit 960 may transmit the electrical signal converted from the received audio data to the speaker 961, and convert the electrical signal into a sound signal for output by the speaker 961; on the other hand, the microphone 962 converts the collected sound signal into an electrical signal, converts the electrical signal into audio data after being received by the audio circuit 960, and outputs the audio data to the processor 980 for processing, and then transmits the audio data to, for example, another mobile phone through the RF circuit 910, or outputs the audio data to the memory 920 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send electronic mails, browse webpages, access streaming media and the like through the WiFi module 970, and provides wireless broadband internet access for the user. Although fig. 9 shows the WiFi module 970, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 980 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 920 and calling data stored in the memory 920. Alternatively, processor 980 may include one or more processing units; preferably, the processor 980 may integrate an application processor, which primarily handles operating system, user interface, and applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 980.

The handset also includes a power supply 990 (e.g., a battery) for supplying power to the various components, which may preferably be logically connected to the processor 980 via a power management system, thereby providing management of charging, discharging, and power consumption via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In this embodiment, the processor 980 included in the terminal device further has the following functions:

The embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is configured to store a program code, where the program code is configured to execute any one of the neural network model training methods described in the embodiments corresponding to fig. 1 to 6.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A neural network model training method is characterized in that aiming at a learning object needing reinforcement learning, the method comprises the following steps:

acquiring an artificial sample set generated by the learning object according to user operation;

acquiring a machine sample set obtained by autonomous learning of a neural network model for the learning object in the learning object; the neural network model for the learning object is used for performing reinforcement learning on the learning object; the artificial sample set comprises a corresponding relation among actions implemented in the learning object through user operation, environment parameters of the learning object when the actions are implemented and feedback parameters of the learning object after the actions are implemented; the machine sample set comprises a corresponding relation among actions implemented in the learning object through the neural network model, environment parameters of the learning object when the actions are implemented and feedback parameters of the learning object after the actions are implemented;

training the neural network model according to the artificial sample set and the machine sample set, specifically comprising:

and training the neural network model according to the training sample set, wherein the proportion of the artificial samples in the training sample set adopted by the Nth training of the neural network model is smaller than that of the artificial samples in the training sample set adopted by the N-1 th training of the neural network model.

2. The method of claim 1, wherein prior to the training of the neural network model according to the training sample set, the method further comprises:

pre-training the neural network model according to the artificial sample set;

training the neural network model according to the training sample set, including:

and training the neural network model which is subjected to the pre-training according to the training sample set.

3. The method of claim 2, wherein it is determined that the neural network model completes pre-training when the set of artificial samples is trained or a pre-set progress of the learning object is completed by the neural network model.

4. The method of claim 1, wherein the predetermined ratio between the number of artificial samples and the number of machine samples in the training sample set used for the nth training of the neural network model is determined according to the training result of the (N-1) th training of the neural network model.

5. The method of claim 1, wherein the step of obtaining a neural network model for the learning object to autonomously learn in the learning object a set of machine samples comprises:

6. The method of claim 1, wherein the target samples comprise feedback parameters of the learning object after the target action is performed, and the feedback parameters of the learning object after the target action is performed comprise reward parameters obtained by performing the target action and/or environmental parameters obtained in the learning object according to performing the target action.

7. The neural network model training device is characterized by comprising a first acquisition unit, a second acquisition unit and a training unit for a learning object needing reinforcement learning:

the first acquisition unit is used for acquiring an artificial sample set generated by the learning object according to user operation;

the second acquisition unit is used for acquiring a machine sample set obtained by autonomous learning of a neural network model aiming at the learning object in the learning object; the neural network model for the learning object is used for performing reinforcement learning on the learning object; the artificial sample set comprises a corresponding relation among actions implemented in the learning object through user operation, environment parameters of the learning object when the actions are implemented and feedback parameters of the learning object after the actions are implemented; the machine sample set comprises a corresponding relation among actions implemented in the learning object through the neural network model, environment parameters of the learning object when the actions are implemented and feedback parameters of the learning object after the actions are implemented;

the training unit is configured to train the neural network model according to the artificial sample set and the machine sample set, and specifically includes:

and training the neural network model according to the training sample set, wherein the proportion of the artificial samples in the training sample set adopted by the Nth training of the neural network model is smaller than that of the artificial samples in the training sample set adopted by the (N-1) th training of the neural network model.

8. The apparatus of claim 7, further comprising a pre-training unit to:

the pre-training unit is used for pre-training the neural network model according to the artificial sample set;

the training unit is specifically configured to:

and training the neural network model which is pre-trained according to the artificial sample set and the machine sample set.

9. The apparatus of claim 8, wherein the neural network model is determined to complete pre-training when the set of artificial samples is trained or a pre-set progress of the learning object is completed by the neural network model.

10. An apparatus for neural network model training, the apparatus comprising a processor and a memory:

the processor is configured to execute the neural network model training method of any one of claims 1-6 according to instructions in the program code.

11. A computer-readable storage medium for storing program code for performing the neural network model training method of any one of claims 1-6.