CN117725979B

CN117725979B - Model training method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN117725979B
Application number: CN202311266738.8A
Authority: CN
Inventors: 李军伟
Original assignee: Xingyin Information Technology Shanghai Co ltd
Current assignee: Xingyin Information Technology Shanghai Co ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-09-20
Anticipated expiration: 2043-09-27
Also published as: CN117725979A

Abstract

The application discloses a model training method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: acquiring training data, wherein the training data is used for updating parameters of a model to be trained; processing the training data through the model to be trained, and determining a target gradient of a first layer of the model to be trained; obtaining an updated gradient of the first layer based on a preset target weight and the target gradient, wherein the target weight represents the degree of improvement of the output of the first layer on the accuracy of the result output by the model to be trained; and in the process of updating the parameters of the model to be trained based on the training data, updating the parameters of the first layer based on the updating gradient to obtain a target model.

Description

Model training method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a model training method and apparatus, an electronic device, and a computer readable storage medium.

Background

By training the model, the model can be provided with the capability of processing data. However, in order to improve the training effect, it is generally necessary to perform sufficient training on the model, but the sufficient training needs to take a long time, which results in long training time and low training efficiency. Therefore, how to shorten the training time and improve the training efficiency has very important significance.

Disclosure of Invention

The application provides a model training method and device, electronic equipment and a computer readable storage medium.

In a first aspect, a model training method is provided, the method comprising:

Acquiring training data, wherein the training data is used for updating parameters of a model to be trained;

processing the training data through the model to be trained, and determining a target gradient of a first layer of the model to be trained;

Obtaining an updated gradient of the first layer based on a preset target weight and the target gradient, wherein the target weight represents the degree of improvement of the output of the first layer on the accuracy of the result output by the model to be trained;

and in the process of updating the parameters of the model to be trained based on the training data, updating the parameters of the first layer based on the updating gradient to obtain a target model.

In combination with any one of the embodiments of the present application, in the process of updating the parameters of the model to be trained based on the training data, updating the parameters of the first layer based on the update gradient to obtain a target model, including:

And in the process of updating the parameters of the model to be trained based on the training data, updating the parameters of the first layer and the target weight based on the updating gradient to obtain a target model.

In combination with any one of the embodiments of the present application, the model to be trained includes a second layer, and an output of the second layer is a result output by the model to be trained;

the training data is processed through the model to be trained, and the target gradient of the first layer is determined, which comprises the following steps:

Processing the training data through the model to be trained, wherein the first layer outputs a first result, and the second layer outputs a second result;

splicing the first result and the second result to obtain a spliced result;

and determining the target gradient based on the difference between the splicing result and the label of the training data.

In combination with any one of the embodiments of the present application, the splicing the first result and the second result to obtain a spliced result includes:

splicing the first result and the second result to obtain an intermediate result;

And determining the product of the intermediate result and the target weight as the splicing result.

In combination with any one of the embodiments of the present application, the splicing the first result and the second result to obtain an intermediate result includes:

encoding the first result to obtain an encoded first result, wherein the dimension of the encoded first result is the same as the dimension of the second result;

and splicing the encoded first result and the encoded second result to obtain the intermediate result.

In combination with any one of the embodiments of the present application, the determining the target gradient based on the difference between the stitching result and the label of the training data includes:

determining a loss of the first layer based on a difference of the splice result and a label of the training data;

The target gradient is determined based on the loss of the first layer.

In combination with any of the embodiments of the present application, the determining the target gradient based on the loss of the first layer includes:

obtaining the target gradient by calculating the loss of the first layer and the bias of the parameters in the first layer

In combination with any one of the embodiments of the present application, the obtaining the update gradient of the first layer based on the preset target weight and the target gradient includes:

and determining the product of the target weight and the target gradient as the updated gradient.

In combination with any one of the embodiments of the present application, the training data is training text including a mask, and the result output by the model to be trained includes a predicted result of the mask.

In a second aspect, there is provided a model training apparatus, the apparatus comprising:

an acquisition unit for acquiring training data, the training data are used for updating parameters of the model to be trained;

The determining unit is used for processing the training data through the model to be trained and determining the target gradient of the first layer of the model to be trained;

The processing unit is used for obtaining an update gradient of the first layer based on a preset target weight and the target gradient, wherein the target weight represents the degree of improvement of the output of the first layer on the accuracy of the result output by the model to be trained;

and the updating unit is used for updating the parameters of the first layer based on the updating gradient in the process of updating the parameters of the model to be trained based on the training data so as to obtain a target model.

In combination with any one of the embodiments of the present application, the updating unit is configured to:

The determining unit is used for:

splicing the first result and the second result to obtain a spliced result;

In combination with any one of the embodiments of the present application, the determining unit is configured to:

The target gradient is determined based on the loss of the first layer.

and obtaining the target gradient by calculating the loss of the first layer and the bias of the parameters in the first layer.

In combination with any one of the embodiments of the present application, the processing unit is configured to:

In a third aspect, an electronic device is provided, comprising: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform a method as described in the first aspect and any one of its possible implementations.

In a fourth aspect, there is provided another electronic device comprising: a processor, a transmitting means, an input means, an output means and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the first aspect and any implementation thereof as described above.

In a fifth aspect, there is provided a computer readable storage medium having stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the first aspect and any implementation thereof as described above.

In a sixth aspect, there is provided a computer program product comprising a computer program or instructions which, when run on a computer, cause the computer to perform the first aspect and any embodiments thereof.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

In the application, after training data is acquired, the model training device processes the training data by utilizing the model to be trained to determine the target gradient of the first layer in the model to be trained. The preset target weight of the first layer characterizes the degree of improvement of the accuracy of the result output by the to-be-trained model by the first layer, so that the model training device optimizes the target gradient based on the target weight to obtain the update gradient of the first layer, and then in the process of updating the parameters of the to-be-trained model, the target model is obtained based on the update gradient to update the parameters of the first layer, so that the to-be-trained model can complete training faster, the accuracy of the result output by the target model is improved, and the training efficiency and the training effect of the to-be-trained model are improved.

According to the embodiment of the application, the gradient of each layer in the model to be trained is optimized, so that the optimization direction of updating the parameters of each layer based on the gradient of each layer is more accurate, the time consumed by updating the parameters of each layer based on the gradient of each layer is shortened, and the time consumed by convergence of the model to be trained is shortened. Specifically, the method based on the embodiment of the application trains the model to be trained, and the shortening amount of the time consumed for converging the model to be trained is the sum of the shortening amount of the time consumed by the parameters of each layer. The number of layers of the model with large scale is large, or the parameters of the model with large scale are large, so that the time consumed by convergence of the model to be trained can be obviously shortened and the training efficiency can be improved under the condition that the model to be trained is the model with large scale.

Drawings

In order to more clearly describe the embodiments of the present application or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present application or the background art.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flow chart of a model training method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a model to be trained according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a model training device according to an embodiment of the present application;

Fig. 4 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

By training the model, the model can learn the ability to perform tasks, and thus can be used to perform tasks. For example, the task may be to determine content corresponding to a mask in the document, and for example, the task may be to determine a relevance between the document and the document, and for example, the task may be to determine a relevance between a search term and the document, and for example, the task may be to classify an image. The ability of the model to perform tasks depends on the training effect of the model, that is, the training effect of the model is good, and the trained model has strong ability to perform tasks.

In order to improve the training effect, the model needs to be fully trained, but the sufficient training needs to take a long time, so that the training time is long and the training efficiency is low. Therefore, how to shorten the training time and improve the training efficiency has very important significance.

The execution main body of the embodiment of the application is a model training device, wherein the model training device can be any electronic equipment capable of executing the technical scheme disclosed by the embodiment of the method of the application. Alternatively, the model training means may be one of the following: computer, server.

It should be understood that the method embodiments of the present application may also be implemented by means of a processor executing computer program code. Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Referring to fig. 1, fig. 1 is a flow chart of a model training method according to an embodiment of the application.

101. Training data is acquired.

In the embodiment of the application, the training data is data for training a model to be trained. In one possible implementation, the model to be trained is a large-scale model, wherein the large-scale model satisfies at least one of the following conditions: the number of parameters exceeds the parameter threshold and the number of modules exceeds the module threshold. For example, if the parameter threshold is 100 tens of thousands, then models with a number of parameters exceeding 100 tens of thousands are large-scale models. For another example, if the module threshold is 1 ten thousand, then models with a number of modules exceeding 1 ten thousand are large-scale models. For another example, if the parameter threshold is 100 ten thousand and the module threshold is 1 ten thousand, then if the number of parameters of the model exceeds 100 ten thousand and the number of models exceeds 1 ten thousand, the model is a model with a large scale.

The module includes layers and neurons in the neural network, for example, the neural network includes a convolution layer, a pooling layer, a normalization layer, a linearization layer, a full connection layer and an activation layer, and then the neural network is a model with large scale in the case that the total number of layers of the convolution layer, the pooling layer, the normalization layer, the linearization layer, the full connection layer and the activation layer exceeds 1 ten thousand. As another example, a neural network includes neurons, and then in the case where the total number of neurons exceeds 1 ten thousand, the neural network is a model of large scale. Optionally, the model to be trained is one of the following: chat models (CHAT GENERATIVE PRE-trained transformer, chatGPT), residual networks (ResNet), graph convolution neural networks (graph convolutional network, GCN), graph neural networks (graph neural networks, GNN), large language models (large language model, LLM), sentence-to-sentence models (sequence to sequence, seq2 seq), recurrent neural networks (recurrent neural network, RNN), generate countermeasure networks (GENERATIVE ADVERSARIAL NETS, GAN).

In one possible implementation, where the model to be trained is a text model, the training data includes tagged text data. For example, where the task to be performed by the model to be trained includes predicting whether the user is interested in the document, the training data includes the document with tags, where the tags include whether the user is interested in the document or not.

In another possible implementation, where the model to be trained is an image model, the training data comprises image data with labels, e.g. where the task to be performed by the model to be trained comprises classifying the image, the training data comprises an image with labels, wherein the labels comprise the category of the image.

In a further possible implementation, where the model to be trained is an audio model, the training data comprises audio data with tags, e.g. where the task to be performed by the model to be trained comprises classifying the audio, the training data comprises audio with tags, wherein the tags comprise categories of audio.

In one implementation of acquiring training data, a model training device receives training data input by a user through an input component to acquire training data. The input assembly includes: keyboard, mouse, touch screen, touch pad and audio input device.

In another implementation manner of acquiring training data, the model training device receives training data sent by the terminal to acquire training data. Alternatively, the terminal may be any of the following: cell phone, computer, tablet computer, server, wearable equipment.

102. And processing the training data through the model to be trained, and determining the target gradient of the first layer of the model to be trained.

In the embodiment of the application, the model to be trained can be any model needing training. In one possible implementation, the model to be trained is a text model. In another possible implementation, the model to be trained is an image model. In yet another possible implementation, the model to be trained is an audio model.

In the embodiment of the application, the model to be trained comprises a first layer, wherein the first layer is a model structure in the model to be trained. In one possible implementation, the model to be trained is a neural network, and the first layer is a network layer (e.g., a convolutional layer) in the neural network. It should be appreciated that where the model to be trained includes at least one layer, the first layer is any one of the models to be trained.

The training process of the model to be trained comprises forward propagation and backward propagation, wherein the result output by the model to be trained can be determined through the forward propagation, in the backward propagation process, the gradient of each layer in the model to be trained can be obtained through performing backward gradient calculation based on the result output by the model to be trained and the label of training data, and the gradient of each layer is the basis for updating the parameters of each layer in the backward propagation process. The target gradient is the gradient of the first layer during the back propagation of the model to be trained.

In one possible implementation manner, in the process of training the model to be trained by using training data, the model to be trained can obtain a result output by the model to be trained by processing the training data, then the loss of the model to be trained can be determined based on the result output by the model to be trained and the label of the training data, and then the gradient of the first layer (namely, the target gradient) can be determined based on the loss.

103. And obtaining the updated gradient of the first layer based on the preset target weight and the target gradient.

In the embodiment of the application, the target weight is preset. The target weight represents the degree of improvement of the accuracy of the result output by the first layer to the model to be trained. Specifically, the larger the target weight is, the larger the output of the first layer is, in other words, the higher the importance degree of the first layer in the model to be trained is.

It should be understood that, in the process of processing the input data by the model to be trained and outputting the result, each layer in the model to be trained participates in the process, that is, the output of each layer in the model to be trained affects the result output by the model to be trained, and the output of different layers in the model to be trained affects the result output by the model to be trained differently, in other words, the degree of improvement of the accuracy of the result output by the model to be trained by the output of different layers is different, that is, the degree of importance of different layers in the model to be trained is different for improving the accuracy of the result output by the model to be trained.

In one possible implementation, the model to be trained includes a second layer different from the first layer, and the weight of the second layer characterizes the degree of improvement of the accuracy of the result output by the second layer to the output of the model to be trained, so that in the case that the target weight is greater than the weight of the second layer, the improvement of the accuracy of the result output by the first layer to the output of the model to be trained is greater than the improvement of the accuracy of the result output by the second layer to the output of the model to be trained, that is, the first layer is more important than the second layer.

In the training process of the model to be trained, a layer of parameters can be updated based on the target gradient, and the updating of the parameters of the first layer can lead to the change of the output of the first layer, and the output of the first layer can influence the accuracy of the result output by the model to be trained. Therefore, the gradient of the first layer can be optimized according to the importance degree of the first layer, so that the output of the first layer can be improved to the output result of the model to be trained more. Because the target weight characterizes the importance degree of the first layer, the model training device optimizes the target gradient based on the target weight and the target gradient to obtain an updated gradient of the first layer.

In one possible implementation, the model training means calculates the product of the target weight and the target gradient to obtain the updated gradient of the first layer.

In another possible implementation manner, the model training device calculates a product of the target weight and the target gradient to obtain an intermediate value, and takes the sum of the intermediate value and a preset value as the update gradient of the first layer.

In a further possible implementation, the model training means calculate the product of the square of the target weight and the target gradient to obtain the updated gradient of the first layer.

104. And in the process of updating the parameters of the model to be trained based on the training data, updating the parameters of the first layer based on the updating gradient to obtain a target model.

The training data is used for updating parameters of the model to be trained, so that the model to be trained is trained by using the training data, and the parameters of the model to be trained can be updated. Specifically, training data is input into a model to be trained, a processing result is output through the processing of the training data by the model to be trained, and the loss of the model to be trained is determined according to the difference between the output processing result and the label of the training data. And back-propagating the model to be trained according to the loss of the model to be trained. During the back propagation of the model to be trained, parameters of the model structure in the model to be trained are updated based on the back-propagated gradients.

In the process of back propagation, after the gradient of each layer in the model to be trained is calculated, the parameters of each layer may be updated according to the gradient of each layer as described in step 102. Therefore, in the process of updating the parameters of the model to be trained based on the training data, the model training device updates the parameters of the first layer based on the update gradient, so that the parameters of the first layer can be updated towards the direction of improving the accuracy of the result output by the model to be trained. It should be understood that, in the process of updating the parameters of the model to be trained, the model structures in the model to be trained are updated based on the gradient, but the updating of the parameters of different model structures is different due to the different gradients of the different model structures.

In one possible implementation, the training data is training text that includes a mask. The training text may be text describing any content, for example, the content of the target text is a red car, and for example, the content of the target text is a basketball game. The training text includes a mask, wherein the mask may include: masking, replacing and extracting. For example, training text is: a basketball game held at a school. Masking the "basketball" in the training text may generate a mask in the training text, replacing the "basketball" in the training text corpus with a predetermined character may also generate a mask in the training text, and extracting the "basketball" from the training text may also generate a mask in the training text.

In the case that the training data is a training text including a mask, the result output by the model to be trained includes a masked prediction result, where the prediction result is the content of the mask determined by the model to be trained, that is, the ability of the model to be trained to obtain the content of the mask in the prediction text by training the model to be trained using the training data. In the process of updating the parameters of the model to be trained, the parameters of the first layer are updated based on the update gradient to obtain the target model, so that the training effect of the model to be trained can be improved, and the prediction accuracy of the target model on the content of the mask in the text can be improved.

In another possible implementation, the training data is a training image with a label, where the label includes a category of the training image, e.g., apple is included in the training image, then the label of the training image may be apple, and e.g., car is included in the training image, then the label of the training image may be car. The training image is utilized to train the model to be trained, so that the model to be trained has the capability of classifying the image. Specifically, after the training image is input into the model to be trained, the training image is processed by the model to be trained, and a prediction result of the category of the training image is output. And determining the loss of the model to be trained based on the difference between the prediction result and the label of the training image. And then calculating the loss of the model to be trained and the bias guide of each parameter in the model to be trained through a back propagation algorithm to obtain the back propagation gradient of each layer in the model to be trained. And updating the parameters of each layer based on the counter-propagation gradient of each layer in the model to be trained, wherein the counter-propagation gradient of the first layer is an updated gradient, so that the parameters of the first layer are updated based on the updated gradient. And (3) until the loss of the model to be trained converges, finishing the training of the model to be trained to obtain a target model, and enabling the target model to have the capability of classifying images.

In yet another possible implementation, the training data is a training word pair having a tag, wherein the training word pair includes two words, and the tag includes a degree of matching of the two words in the training word pair. Training the model to be trained by training words can enable the model to be trained to have the capability of determining the matching degree of two words. Specifically, after the training word pairs are input into the model to be trained, word characteristics of two words in the training word pairs are respectively extracted through processing of the training word pairs by the model to be trained, wherein the word characteristics carry semantic information of the words. And then, based on word characteristics of the two words, determining the semantic relevance of the two words, and determining the matching degree of the two words based on the semantic relevance, wherein the matching degree and the semantic relevance are positively correlated. And finally outputting a prediction result of the matching degree of the two words in the training word pair by the model to be trained. And determining the loss of the model to be trained based on the difference between the prediction result and the label of the training word pair. And then calculating the loss of the model to be trained and the bias guide of each parameter in the model to be trained through a back propagation algorithm to obtain the back propagation gradient of each layer in the model to be trained. And updating the parameters of each layer based on the counter-propagation gradient of each layer in the model to be trained, wherein the counter-propagation gradient of the first layer is an updated gradient, so that the parameters of the first layer are updated based on the updated gradient. And (3) until the loss of the model to be trained converges, finishing the training of the model to be trained to obtain a target model, and enabling the target model to have the capability of determining the matching degree of two words.

In the embodiment of the application, after the training data is acquired, the model training device processes the training data by using the model to be trained to determine the target gradient of the first layer in the model to be trained. The preset target weight of the first layer characterizes the degree of improvement of the accuracy of the result output by the to-be-trained model by the first layer, so that the model training device optimizes the target gradient based on the target weight to obtain the update gradient of the first layer, and then in the process of updating the parameters of the to-be-trained model, the target model is obtained based on the update gradient to update the parameters of the first layer, so that the to-be-trained model can complete training faster, the accuracy of the result output by the target model is improved, and the training efficiency and the training effect of the to-be-trained model are improved.

It should be understood that, in the embodiment of the present application, the first layer and the target weight are descriptive objects determined by a succinct description technical solution, and it should not be understood that the model to be trained only includes one layer, and it should not be understood that only one layer has weight. In practical application, the model to be trained may include at least one layer, each layer has a corresponding weight, the weight of each layer is used to optimize the gradient of each layer, and in the process of updating the parameters of the model to be trained, the parameters of each layer can be updated based on the optimized gradients of all layers. In this way, the gradient of each layer can be optimized to obtain the update gradient of each layer according to the importance degree of each layer in the model to be trained, then the parameters of each layer are updated based on the update gradient of each layer, compared with the method that after the gradient of each layer is determined according to the loss of the model to be trained, the parameters of each layer are updated directly based on the gradient of each layer, so that the parameters of each layer can be updated to meet the requirements more quickly, wherein the requirements refer to the convergence of the loss of the model to be trained. Therefore, the training time of the model to be trained can be shortened, the training efficiency of the model to be trained is improved, and the training effect of the model to be trained can be improved.

In one possible implementation application scenario, the training data is training text, and the first layer of the model to be trained is a feature extraction layer. After the training text is input into the model to be trained, the first layer performs feature extraction on the training text to obtain text features of the training text. The model to be trained obtains a processing result of the training text based on the text characteristics, and the processing result is used as a result output by the model to be trained, for example, in the case that the training text comprises a mask, the processing result is obtained by predicting the content corresponding to the mask in the training text. A penalty of the model to be trained is determined based on a difference in the processing result and a masked label, wherein the masked label is a true value (ground truth, GT). And determining a target gradient of the first layer based on the loss, and obtaining an updated gradient of the first layer based on the target weight and the target gradient. In this way, in the process of updating the parameters of the model to be trained, the parameters of the first layer are updated based on the update gradient, and the target model is obtained.

In the application scene, after the loss of the model to be trained is determined, the target gradient of the first layer is determined firstly based on the loss, and then the target gradient is optimized based on the target weight to obtain the update gradient of the first layer, so that the parameter of the first layer is updated based on the update gradient, the result output by the first layer can be optimized towards the direction of improving the accuracy of the processing result of the training text output by the model to be trained, and the accuracy of the processing result of the training text output by the model to be trained by the result output by the first layer can be higher, so that the training efficiency of the model to be trained and the training effect of the model to be trained are improved.

In another possible application scenario, the training data is a training image, and the first layer of the model to be trained is a feature extraction layer. After the training image is input into the model to be trained, the first layer performs feature extraction on the training image to obtain image features of the training image. And obtaining a prediction result of the category of the training image by the model to be trained based on the image characteristics. And determining the loss of the model to be trained based on the difference between the processing result and the label of the training image, wherein the label of the training image is GT. And determining a target gradient of the first layer based on the loss, and obtaining an updated gradient of the first layer based on the target weight and the target gradient. In this way, in the process of updating the parameters of the model to be trained, the parameters of the first layer are updated based on the update gradient, and the target model is obtained.

In the application scene, after the loss of the model to be trained is determined, determining the target gradient of the first layer based on the loss of the model to be trained, and optimizing the target gradient based on the target weight to obtain the updated gradient of the first layer, so that the parameters of the first layer are updated based on the updated gradient, the image features extracted from the first layer can be optimized towards the direction of improving the accuracy of the prediction result output by the model to be trained, and the training efficiency of the model to be trained and the training effect of the model to be trained can be improved.

As an alternative embodiment, the model training apparatus performs the following steps in performing step 104:

201. And in the process of updating the parameters of the model to be trained based on the training data, updating the parameters of the first layer and the target weight based on the updating gradient to obtain a target model.

As described above, the training data is used to update the parameters of the model to be trained, and thus the model to be trained is trained using the training data, and the parameters of the model to be trained can be updated. In updating the parameters of the model to be trained, the parameters of the first layer are updated. In the embodiment, in the process of updating the parameters of the model to be trained based on the training data, the model training device not only updates the parameters of the first layer based on the update gradient, but also updates the target weight based on the update gradient, so that the accuracy of representing the importance degree of the target weight to the first layer is improved.

As an alternative embodiment, the model to be trained comprises a second layer, wherein the output of the second layer is the result of the output of the model to be trained, that is, the second layer is the last layer in the model to be trained. It should be understood that the output of the first layer may be an input of the second layer, for example, the first layer is an input layer of the model to be trained, i.e. the input of the model to be trained is an input of the first layer, the output of the first layer is an input of the second layer, and the output of the second layer is a result of the output of the model to be trained. Other layers may be between the first layer and the second layer, i.e. the output of the first layer is not the input of the second layer, for example, the first layer is the input layer of the model to be trained, i.e. the input of the model to be trained is the input of the first layer, the output of the first layer is the input of the third layer, the output of the third layer is the input of the second layer, and the output of the second layer is the result of the output of the model to be trained.

In this embodiment, the model training apparatus performs the following steps in performing step 102:

301. and processing the training data through the model to be trained, wherein the first layer outputs a first result, and the second layer outputs a second result.

In the process of processing training data by the model to be trained, each layer in the model to be trained can output a result by processing input, wherein the result output by the first layer is a first result, and the result output by the second layer is a second result. For example, the first layer is an input layer of the model to be trained, the output of the first layer is an input of the second layer, and the output of the second layer is a result of the output of the model to be trained. The input of the model to be trained is data a, the data a is processed by a first layer to output data b, namely the data b is a first result, and the data b is processed by a second layer to output data c, namely the second result is data c.

302. And splicing (concat) the first result and the second result to obtain a spliced result.

303. And determining the target gradient based on the difference between the splicing result and the label of the training data.

In the embodiment of the application, the label of the training data is GT, and the accuracy of the output result can be determined by monitoring the output result by the label of the training data. Since the splicing result comprises the first result and the second result, the accuracy of the splicing result can be determined through the supervision of the label of the training data on the splicing result, in other words, the degree of improvement of the accuracy of the first result output by the first layer to the result output by the training model can be determined. Thus, the model training means determines the target gradient of the first layer based on the difference of the stitching result and the label of the training data.

In one possible implementation, the model training means determines the loss of the first layer based on the difference in the splice result and the label of the training data. Optionally, the model training device determines a loss of the model to be trained based on a difference between the splicing result and the trained label, wherein the loss of the model to be trained is positively correlated with the difference. The model training means then determines the loss of the first layer based on the loss of the model to be trained. After determining the loss of the first layer, the model training apparatus may determine the target gradient based on the loss of the first layer. Optionally, the model training device obtains the target gradient by calculating a loss of the first layer and a bias derivative of the parameter in the first layer. Optionally, the model training device calculates loss of the first layer and bias conductance of parameters in the first layer according to a Back Propagation (BP) algorithm, so as to obtain the target gradient.

In such an embodiment, the model training means causes the first layer to output the first result and causes the second layer to output the second result by processing the training data with the model to be trained. Since the output of the second layer is the result output by the model to be trained, the second result is the result output by the model to be trained. Therefore, the first result and the second result are spliced to obtain a spliced result, and the result output by the first layer can be added to the result output by the model to be trained. Therefore, the spliced result can be supervised through the label of the training data, and the degree of improvement of the accuracy of the first result output by the first layer to the result output by the training model can be determined. Therefore, the model training device determines the target gradient based on the difference between the splicing result and the label of the training data, and the accuracy of the target gradient can be improved.

It should be understood that, in practical application, the model training device processes training data through the model to be trained, and determines the output results of all non-output layers in the model to be trained, where the non-output layers are layers other than the second layer. And splicing the results output by the non-output layers with the second results output by the second layer respectively to obtain the superposition result of the non-output layers. And then determining the counter-propagation gradient of each non-output layer based on the superposition result of each non-output layer and the difference of the labels of the training data. Optionally, the model training device determines the loss of each non-output layer based on the superposition result of each non-output layer and the difference of the label of the training data. And obtaining the counter propagation gradient of each non-output layer by calculating the loss of each non-output layer and the bias of the parameters in each non-output layer. Optionally, the model training device calculates loss of each non-output layer and partial derivatives of parameters in each non-output layer according to a BP algorithm to obtain counter propagation gradients of each non-output layer.

As an alternative embodiment, the model training apparatus performs the following steps in the step of performing step 302:

401. And splicing the first result and the second result to obtain an intermediate result.

It should be appreciated that in step 302, the first result and the second result are spliced, but in this step the model splices the first result and the second result to obtain not a spliced result but an intermediate result.

402. And determining the product of the intermediate result and the target weight as the splicing result.

Since the intermediate results include the first results output by the first layer, and the degree of improvement in accuracy of the results output by the first layer to the model to be trained can be characterized by the target weights, the intermediate results can be optimized based on the target weights. Specifically, the model training device takes the product of the intermediate result and the target weight as a splicing result.

In the embodiment, the model training device firstly splices the first result and the second result to obtain an intermediate result, and then optimizes the intermediate result by using the target weight to obtain a spliced result so as to improve the accuracy of the spliced result. Specifically, the model training device determines the product of the intermediate result and the target weight as a splicing result, so that the accuracy of the splicing result as the result output by the model to be trained can be improved.

Optionally, in practical application, the model training apparatus may determine the superposition result of any one non-output layer according to the manner of determining the target gradient of the first layer in step 401 and step 402. Specifically, the model training device splices the results output by each non-output layer with the second results output by the second layer respectively to obtain spliced results of each non-output layer. And then determining products of the pieced results of the non-output layers and the weights of the non-output layers respectively to obtain superposition results of the non-output layers, wherein the weights of the non-output layers represent the degree of improvement of the accuracy of the results output by the model to be trained by the output of the non-output layers.

As an alternative embodiment, the model training apparatus performs the following steps in performing step 401:

501. and encoding the first result to obtain an encoded first result.

In the embodiment of the application, the dimension of the encoded first result is the same as the dimension of the second result, that is, the dimension of the first result is the same as the dimension of the second result by encoding the first result by the model training device. For example, the dimension of the first result is 3 and the dimension of the second result is 6, then the dimension of the encoded first result obtained by encoding the first result is 6.

502. And splicing the encoded first result and the encoded second result to obtain the intermediate result.

In this embodiment, the model training apparatus encodes the first result so that the dimension of the first result is the same as the dimension of the second result, thereby obtaining the encoded first result. In this way, the first result and the second result after encoding can be spliced to obtain an intermediate result under the condition that the dimension of the first result after encoding is the same as that of the second result.

Optionally, in practical application, the model training apparatus may determine a pieced result of any one non-output layer in a manner that the intermediate result of the first layer is determined in step 501 and step 502. Specifically, the model training device encodes the results output by each non-output layer respectively to obtain the encoding results of each non-output layer, wherein the dimension of the encoding results of each non-output layer is the same as the dimension of the second result. And respectively splicing the coding result and the second result of each non-output layer to obtain a splicing result of each non-output layer.

Based on the model training method provided by the embodiment of the application, the embodiment of the application also provides a model training implementation mode. Referring to fig. 2, fig. 2 is a schematic structural diagram of a model to be trained according to an embodiment of the present application. As shown in fig. 2, in the process of training the model to be trained using training data, the input of the model to be trained is the training data. The modules in the virtual frame in fig. 2 are structures of a model to be trained, as shown in fig. 2, the model to be trained comprises a layer, two layers, … and N layers, if the model to be trained is expressed as M, m= { layer ₁,layer₂,…,layer_n }, N is greater than or equal to 1, wherein layer ₁ represents one layer, layer ₂ represents two layers, …, layer _n represents N layers, and N represents the number of layers.

As shown in fig. 2, N layers in the model to be trained are sequentially connected in series, that is, the output of one layer is the input of two layers, the output of two layers is the input of three layers, …, and the output of (N-1) layer is the input of N layers. The first layer, the second layer and … (N-1) layers are non-output layers, the N layers are output layers of the model to be trained, the first layer is an input layer of the model to be trained, namely training data is input into the model to be trained, and is firstly input into the first layer, sequentially processed by the layers, and output of the model to be trained is achieved through output of the N layers. .

Each layer in the model to be trained shown in fig. 2 has a weight, as shown in fig. 2, w ₁ is a weight of one layer, w ₂ is a weight of two layers, … and w _n are weights of N layers, wherein w ₁ represents the degree of improvement of the accuracy of the result output by the model to be trained by the output of one layer, w ₂ represents the degree of improvement of the accuracy of the result output by the model to be trained by the output of two layers, and w _n represents the degree of improvement of the accuracy of the result output by the model to be trained by the output of N layers. If the set of weights of each layer in the model to be trained is denoted as W, then w= { W ₁,w₂,…,w_n }, N is greater than or equal to 1, where W ₁ denotes the weight of one layer, W ₂ denotes the weight of two layers, …, and W _n denotes the weight of N layers.

In the forward propagation process of the model to be trained, the model to be trained processes training data through N layers to obtain output results of all layers, wherein the output result of one layer is out ₁, the output result of two layers is out ₂, the output result of three layers is out ₃ and …, and the output result of N layers is out _n. Constructing an output matrix out _w based on the output structure of each layer in the model to be trained, and specifically:

And encoding the output matrix through an encoding module (Encoder) so that the dimension of the output matrix is the same as the dimension of out _n, and obtaining the encoded output matrix. And then, re-determining the output result of the model to be trained according to the output matrix after encoding and the output of the N layers, specifically, Where out represents the redetermined output result of the model to be trained (hereinafter simply referred to as new result), out _n represents the output result of N layers, out _E represents the encoded output matrix, W represents the set of weights of each layer in the model to be trained, and matmul (a, B) represents a×b. The new results comprise the superposition results of the layers in the model to be trained, specifically, out _E comprises the output results of the layers in the model to be trained, thusIndicating that the output results of each layer in the model to be trained are respectively spliced with the output results of N layers of the model to be trained to obtain the splicing results of each layer,And (3) determining the product of the pieced result of each layer and the weight of each layer in the model to be trained to obtain the superposition result of each layer. For example, in the case that one layer in the model to be trained is the first layer described above, the superposition result of one layer in the new results is the splicing result of the first layer.

During the back propagation of the model to be trained, the back propagation gradient of each layer is determined through back gradient calculation. Specifically, the model training device firstly determines the Loss of the model to be trained according to the difference between the output of the model to be trained and the label of the training data, and records the Loss of the model to be trained as Loss, then loss=f (out, label), wherein F (·) represents a Loss function, out represents the output of the model to be trained, and label represents the label of the training data. And then the model training device calculates the loss of the model to be trained and the bias guide of the parameters in each layer through a BP algorithm to obtain the counter-propagation gradient of each layer. The set of the back propagation gradients of each layer in the model to be trained is denoted as G, then G= { G ₁,g₂,…,g_n }, N is equal to or greater than 1, wherein G ₁ represents the back propagation gradient of one layer, G ₂ represents the back propagation gradient of two layers, …, and G _n represents the back propagation gradient of N layers. The model training device updates the counter-propagation gradient of each layer based on the weight of each layer to obtain the update gradient of each layer, and the set of the update gradients of each layer in the model to be trained is recorded asThenFinally, the model training device updates parameters of each layer and weights of each layer based on the updated gradient of each layer until the loss of the model to be trained converges, and the training of the model to be trained is completed, so that the target model is obtained.

According to the model training method provided by the embodiment of the application, the output results of all layers of the model to be trained are spliced with the output results of the model to be trained, and the output results of all layers are respectively added into the output results of the model to be trained, so that the degree of improvement of the accuracy of the output results of the model to be trained by the output results of all layers can be determined based on the output results of the model to be trained. The output result of the model to be trained is determined based on the weight of each layer, so that the accuracy of the output of the model to be trained can be improved, and after the counter-propagation gradient of each layer is determined based on the output result of the model to be trained and the label of the training data, the counter-propagation gradient of each layer is optimized based on the weight of each layer, so that the update gradient of each layer is obtained, and the accuracy of the counter-propagation gradient of each layer can be improved. Therefore, parameters of each layer and weights of each layer are updated based on the updating gradient of each layer, so that the optimization efficiency of each layer can be improved, the model to be trained can be converged rapidly, the accuracy of the weights of each layer can be improved, and the training efficiency and the training effect of the model to be trained can be improved.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

The foregoing details of the method according to the embodiments of the present application and the apparatus according to the embodiments of the present application are provided below.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application, where the model training apparatus 1 includes: acquisition unit 11, determination unit 12, processing unit 13, updating unit 14, in particular:

an obtaining unit 11, configured to obtain training data, where the training data is used to update parameters of a model to be trained;

A determining unit 12, configured to process the training data through the model to be trained, and determine a target gradient of a first layer of the model to be trained;

The processing unit 13 is configured to obtain an update gradient of the first layer based on a preset target weight and the target gradient, where the target weight represents a degree of improvement of accuracy of a result output by the first layer to the model to be trained;

And an updating unit 14, configured to update the parameters of the first layer based on the update gradient in a process of updating the parameters of the model to be trained, so as to obtain a target model.

In combination with any embodiment of the present application, the updating unit 14 is configured to:

the determining unit 12 is configured to:

splicing the first result and the second result to obtain a spliced result;

In combination with any of the embodiments of the present application, the determining unit 12 is configured to:

The target gradient is determined based on the loss of the first layer.

And calculating the loss of the first layer and the bias derivative of the parameters in the first layer through a back propagation algorithm to obtain the target gradient.

In combination with any embodiment of the present application, the processing unit 13 is configured to:

In some embodiments, the functions or modules included in the apparatus provided by the embodiments of the present application may be used to perform the methods described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

Fig. 4 is a schematic hardware structure of an electronic device according to an embodiment of the present application. The electronic device 2 comprises a processor 21 and a memory 22. Optionally, the electronic device 2 further comprises input means 23 and output means 24. The processor 21, memory 22, input device 23, and output device 24 are coupled by connectors including various interfaces, transmission lines or buses, etc., as are not limited by the present embodiments. It should be appreciated that in various embodiments of the application, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

The processor 21 may comprise one or more processors, for example one or more central processing units (central processing unit, CPU), which in the case of a CPU may be a single-core CPU or a multi-core CPU. Alternatively, the processor 21 may be a processor group constituted by a plurality of CPUs, the plurality of processors being coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the application is not limited.

Memory 22 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present application. Optionally, the memory includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.

It will be appreciated that in the embodiment of the present application, the memory 22 may be used to store not only related instructions, but also related data, for example, the memory 22 may be used to store a model to be trained and training data obtained through the input device 23, or the memory 22 may be used to store a target model obtained through the processor 21, etc., and the embodiment of the present application is not limited to the data specifically stored in the memory.

It will be appreciated that fig. 4 shows only a simplified design of an electronic device. In practical applications, the electronic device may further include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all electronic devices that can implement the embodiments of the present application are within the scope of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments of the present application are provided with emphasis, and that the same or similar parts may not be described in detail in different embodiments for convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in description of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (DIGITAL VERSATILEDISC, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A method of model training, the method comprising:

Obtaining training data, wherein the training data is used for updating parameters of a model to be trained, and the training data comprises one of the following components: text data with a tag, image data with a tag, audio data with a tag;

2. The method according to claim 1, wherein in updating the parameters of the model to be trained based on the training data, updating the parameters of the first layer based on the update gradient to obtain a target model comprises:

3. A method according to claim 1 or 2, wherein the model to be trained comprises a second layer, the output of the second layer being the result of the output of the model to be trained;

splicing the first result and the second result to obtain a spliced result;

4. A method according to claim 3, wherein the splicing the first result and the second result to obtain a spliced result comprises:

5. The method of claim 4, wherein the stitching the first result and the second result to obtain an intermediate result comprises:

6. The method of claim 3, wherein the determining the target gradient based on a difference in the splice result and a label of the training data comprises:

The target gradient is determined based on the loss of the first layer.

7. The method of claim 6, wherein the determining the target gradient based on the loss of the first layer comprises:

8. The method according to claim 1 or 2, wherein the obtaining the updated gradient of the first layer based on the preset target weight and target gradient comprises:

9. The method according to claim 1 or 2, wherein the training data is training text comprising a mask, and the result output by the model to be trained comprises a predicted result of the mask.

10. A model training apparatus, the apparatus comprising:

The training device comprises an acquisition unit, a training unit and a training unit, wherein the acquisition unit is used for acquiring training data, the training data is used for updating parameters of a model to be trained, and the training data comprises one of the following components: text data with a tag, image data with a tag, audio data with a tag;

11. An electronic device, comprising: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any one of claims 1 to 9.

12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 9.