CN116976428A

CN116976428A - Model training method, device, equipment and storage medium

Info

Publication number: CN116976428A
Application number: CN202211289380.6A
Authority: CN
Inventors: 李丽芳; 于斌斌
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-10-31

Abstract

The embodiment of the disclosure discloses a model training method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a training set, a testing set and an initial model; training the initial model by using a training set to obtain a teacher model; based on the test set, compressing the teacher model at least twice to obtain a student model; the difference value between the prediction precision of the student model and the prediction precision of the teacher model is smaller than a precision threshold; the student model is used for deploying to terminal equipment with hardware resources meeting preset conditions, and predicting data to be processed; the compression process includes at least one of: pruning, quantization, coding, knowledge distillation. The embodiment of the disclosure can reduce the storage requirement of the student model by times under the condition of less influence on the prediction precision of the student model.

Description

Model training method, device, equipment and storage medium

Technical Field

The present disclosure relates to, but not limited to, the field of artificial intelligence, and in particular, to a model training method, apparatus, device, and storage medium.

Background

With the rapid development of artificial intelligence, more diverse, complex tasks can be handled by trained models. Because of factors such as improving the prediction precision of the model, the structure of the model trained in the related technology is more and more complex, and the data volume is also more and more huge, the model is difficult to deploy on a system or terminal with limited hardware resources.

Disclosure of Invention

In view of this, embodiments of the present disclosure at least provide a model training method, apparatus, device, and storage medium.

The technical scheme of the embodiment of the disclosure is realized as follows:

in one aspect, an embodiment of the present disclosure provides a model training method, including: acquiring a training set, a testing set and an initial model; training the initial model by using the training set to obtain a teacher model; based on the test set, compressing the teacher model at least twice to obtain a student model; wherein a difference between the prediction accuracy of the student model and the prediction accuracy of the teacher model is less than an accuracy threshold; the student model is used for deploying to terminal equipment with hardware resources meeting preset conditions, and predicting data to be processed; the compression process includes at least one of: pruning, quantization, coding, knowledge distillation.

In another aspect, an embodiment of the present disclosure provides a model training apparatus, including: the method comprises the steps of obtaining a model, wherein the model is used for obtaining a training set, a testing set and an initial model; the training module is used for training the initial model by utilizing the training set to obtain a teacher model; the compression module is used for compressing the teacher model at least twice based on the test set to obtain a student model; wherein a difference between the prediction accuracy of the student model and the prediction accuracy of the teacher model is less than an accuracy threshold; the student model is used for deploying to terminal equipment with hardware resources meeting preset conditions, and predicting data to be processed; the compression process includes at least one of: pruning, quantization, coding, knowledge distillation.

In yet another aspect, embodiments of the present disclosure provide a computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing some or all of the steps of the above method when the program is executed.

In yet another aspect, the disclosed embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs some or all of the steps of the above method.

In yet another aspect, the disclosed embodiments provide a computer program comprising computer readable code which, when run in a computer device, causes a processor in the computer device to perform some or all of the steps for carrying out the above method.

In yet another aspect, the disclosed embodiments provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program which, when read and executed by a computer, performs some or all of the steps of the above method.

Compared with the related art, the training is performed by acquiring complicated models and small-scale models of different structures, so that a teacher model and a student model are obtained. In the embodiment of the disclosure, firstly, a training set, a testing set and an initial model are acquired; and training the initial model by using the training set to obtain a teacher model. Thus, the training amount is reduced, and the efficiency of obtaining the student model later is improved. And secondly, based on the test set, performing compression processing on the teacher model at least twice to obtain a student model. The difference value between the prediction precision of the student model and the prediction precision of the teacher model is smaller than a precision threshold; the student model is used for deploying to terminal equipment with hardware resources meeting preset conditions, and predicting data to be processed; the compression process includes at least one of: pruning, quantization, coding, knowledge distillation. Therefore, the teacher model is compressed in different compression modes, and the difference between the prediction precision of the student model and the prediction precision of the teacher model can be reduced simultaneously under the condition of reducing the scale of the student model.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the aspects of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 is a schematic implementation flow diagram of a first training model method according to an embodiment of the disclosure;

FIG. 2 is a schematic flow chart of an implementation of a second training model method according to an embodiment of the disclosure;

FIG. 3 is a schematic flow chart of an implementation of a third training model method according to an embodiment of the disclosure;

FIG. 4 is a schematic flow chart of an implementation of a fourth training model method according to an embodiment of the disclosure;

FIG. 5 is a schematic diagram of a model training system according to an embodiment of the present disclosure;

FIG. 6 is a schematic flowchart of a fifth training model method according to an embodiment of the disclosure;

fig. 7 is a schematic diagram of a composition structure of a model training device according to an embodiment of the disclosure;

fig. 8 is a schematic diagram of a hardware entity of a computer device according to an embodiment of the disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present disclosure more apparent, the technical solutions of the present disclosure are further elaborated below in conjunction with the drawings and the embodiments, and the described embodiments should not be construed as limiting the present disclosure, and all other embodiments obtained by those skilled in the art without making inventive efforts are within the scope of protection of the present disclosure.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict. The term "first/second/third" is merely to distinguish similar objects and does not represent a particular ordering of objects, it being understood that the "first/second/third" may be interchanged with a particular order or precedence where allowed, to enable embodiments of the disclosure described herein to be implemented in other than those illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing the present disclosure only and is not intended to be limiting of the present disclosure.

Embodiments of the present disclosure provide a model training method that may be performed by a processor of a computer device. The computer device may be a device with model training capability, such as a server, a notebook computer, a tablet computer, a desktop computer, a smart television, a set-top box, a mobile device (e.g., a mobile phone, a portable video player, a personal digital assistant, a dedicated messaging device, and a portable game device). Fig. 1 is a schematic implementation flow chart of a first model training method according to an embodiment of the disclosure, as shown in fig. 1, the method includes steps S101 to S103 as follows:

step S101, a training set, a testing set and an initial model are obtained.

Here, the training set may refer to a set of samples or data samples for fitting by the teacher model, and the initial model is trained by using the training set to obtain the teacher model; a test set may refer to a collection of samples used for knowledge distillation of a teacher model. The training set and the test set may include annotated samples of the type radar data, text, images, audio, etc., with a first sample in the training set and a second sample in the test set having the same scene, e.g., the same face scene, audio scene, or road scene, etc. The tasks for which the teacher model and the student model are trained by the first sample and the second sample of different scenes are different, for example: the first sample is a face image, the second sample is a face image, and the trained teacher model and face model can be used for face recognition; the first sample is a street view image, the second sample is also a street view image, and the trained teacher model and face model can be used for path planning of vehicles and the like. The number of samples in the training set is equal to the number of samples in the test set, or the number of samples in the training set is greater than the number of samples in the test set, etc. The initial model may refer to an untrained teacher model, and the number of parameters in the initial model may be greater than a preset number, for example, when the initial model is a neural network model, the number of network layers of the initial model is greater than the preset number of layers, and so on.

Step S101 may include receiving an uploaded sample set, and randomly distributing the sample set according to a preset distribution ratio to obtain a training set and a test set of a current distribution ratio; and receiving the uploaded initial model. In some embodiments, crawlers may also be performed from the internet, resulting in a sample set; and crawling an initial model meeting preset conditions from the Internet, wherein the preset conditions can be that the data size of the initial model is larger than the preset data size and the like.

Step S102, training the initial model by using the training set to obtain a teacher model.

Here, the teacher model (teacher model) may refer to a trained initial model, and the number of parameters in the teacher model may be greater than a preset number, for example, when the teacher model is a neural network model, the number of network layers of the teacher model is greater than the preset number of layers, and so on. In the implementation process of step S102, each training sample in the training set can be predicted by using the initial model to obtain a prediction result; determining the current loss of the initial model based on the feature distance between the prediction result and the training sample; determining the current gradient of the initial model based on the current loss by using a gradient descent method, and adjusting the weight of the initial model according to the current gradient and a preset step length; and under the condition that the prediction result of the initial model meets the preset precision condition, determining the trained initial model as a teacher model. The teacher model has the attributes that the memory occupation amount is larger than the preset data amount when the teacher model is deployed in the computer equipment, or the reasoning time of the teacher model is larger than the preset duration, and the like, so that knowledge distillation processing can be performed on the teacher model.

And step S103, based on the test set, performing compression processing on the teacher model at least twice to obtain a student model.

Here, the student model (student model) may refer to a model in which a compression process is performed on a teacher model, and the number of parameters in the student model may be less than or equal to a preset number, for example, when the student model is a neural network model, the number of network layers of the student model is less than or equal to a preset number of layers, and the like. The various teacher models obtained by training are widely applied to the fields of computer vision, natural language processing, recommended search and the like. The large amount of weights to be trained in the teacher model often occupy a large amount of memory space and storage bandwidth, and great difficulty is brought to the deployment of the teacher model on some embedded platforms and mobile terminals. In addition, the amount of calculation required for the teacher model is huge in energy consumption, and the calculation cost is relatively expensive. Model compression can be used for performing operations such as parameter compression and dimension reduction on a teacher model on an original network structure or redesigning a simple network structure to obtain a student model so as to improve training and reasoning speeds of the model, reduce the scale of the model and the like. According to whether the model compression changes the network structure, the model compression can be divided into shallow compression and deep compression, the shallow compression can comprise model cutting and knowledge distillation, and the deep compression can comprise model quantization, design of a lightweight network structure and the like.

The main goal of Pruning (Pruning) is to reduce the memory consumption of the model on the premise that the model accuracy is not sacrificed as much as possible (even if the accuracy can be improved in some scenes), thereby being beneficial to reducing the complexity of the model, reducing the memory, reducing the overfitting to a certain extent, and the like. The purpose of Quantization is also to reduce the size of the space occupied by the weight of the teacher model. Unlike pruning, however, pruning allows for reducing redundant connections in the teacher model, while quantization attempts to solve problems from stored forms of parameters in the teacher model (e.g., using fewer numbers of bits to record weights, sharing/clustering weights, etc.). The coding may be to adopt a huffman coding mode to code the weight of the teacher model, so as to obtain the coded teacher model. Knowledge distillation (Knowledge Distillation) is different from pruning and quantization in model compression, and is to train a student model before distillation by constructing a light-weight student model before distillation and utilizing the supervision information of a trained teacher model with better performance so as to achieve better performance and accuracy. Knowledge distillation may include off-line distillation, semi-supervised distillation, self-supervised distillation, etc.

The difference between the prediction precision of the student model and the prediction precision of the teacher model is smaller than a precision threshold, for example, the prediction precision of the teacher model is 0.95, the prediction precision of the student model is 0.94, the difference between the prediction precision is 0.01, and the difference between the prediction precision and the prediction precision is smaller than an angle threshold is 0.05. Prediction accuracy may refer to the probability of a sample that the predicted outcome matches labeling information used to characterize the true outcome of the sample. The student model is used for deploying to terminal equipment with hardware resources meeting preset conditions and predicting data to be processed. The preset condition may be that a hardware resource that can be currently utilized by the terminal device meets a hardware resource required by the student model for reasoning, and the data to be processed may be data having the same scene as the training set or the test set sample. The hardware resources may include computing resources and/or storage resources, etc., and the current computing resources or size resources may be determined by reading the current performance metrics of the hardware. The performance metrics may include performance metrics such as throughput, memory size, memory bit width, memory bandwidth, memory storage capacity, etc. of the graphics processor (Graphics Processing Unit, GPU). For example: the size of the video memory required by the student model is 256 megabytes, and the size of the video memory of the graphic processor in the terminal equipment is 512 megabytes, so that the student model can be deployed to the terminal equipment; the data volume of the student model is 500 megabytes, the available storage space of the memory in the terminal device is 800 megabytes, then it is determined that the student model can be deployed to the terminal device, and so on.

The compression process includes at least one of: pruning, quantization, coding, knowledge distillation. Step S103 may include: pruning is carried out on the teacher model to obtain a pruned teacher model; and taking the pruned teacher model as an undisturbed student model, and performing knowledge distillation on the undisturbed student model by using the undisturbed teacher model based on the test set to obtain the student model. May further include: pruning is carried out on the teacher model to obtain a pruned teacher model; quantizing the pruned teacher model to obtain a quantized teacher model; taking the quantized teacher model as an undisturbed student model, and performing knowledge distillation on the undisturbed student model by using the undisturbed teacher model based on a test set to obtain a student model and the like. For example: the training set and the testing set comprise industrial product images subjected to defect labeling, and an initial model is deployed to a first terminal device; training an initial model in the first terminal equipment based on the training set to obtain a teacher model; compressing the teacher model in the first terminal equipment at least twice based on the test set to obtain a student model; deploying the student model to a second terminal device, and performing defect detection on the acquired public product image by utilizing the student model in the second terminal device to obtain a detection result; the detection result may include the presence of a defect, the absence of a defect, and the like. The hardware resources of the first terminal device are better than those of the second terminal device, and meanwhile, the difference between the prediction precision of the teacher model and the prediction precision of the student model is smaller than the preset precision.

Compared with the related art, the training is performed by acquiring complicated models and small-scale models of different structures, so that a teacher model and a student model are obtained. In the embodiment of the disclosure, firstly, a training set, a testing set and an initial model are acquired; and training the initial model by using the training set to obtain a teacher model. Thus, the training amount is reduced, and the efficiency of obtaining the student model later is improved. And secondly, based on the test set, performing compression processing on the teacher model at least twice to obtain a student model. The difference value between the prediction precision of the student model and the prediction precision of the teacher model is smaller than a precision threshold; the student model is used for deploying to terminal equipment with hardware resources meeting preset conditions, and predicting data to be processed; the compression process includes at least one of: pruning, quantization, coding, knowledge distillation. Thus, the teacher model is compressed in different types of compression modes, and the difference between the prediction precision of the student model and the prediction precision of the teacher model can be reduced under the condition of reducing the scale of the student model. Meanwhile, the student model is directly obtained by compressing the teacher model for multiple times, so that the relevance between the teacher model and the student model is improved, the training efficiency of the student model is improved, and the like.

The embodiment of the disclosure provides a second model training method, wherein the student models comprise a first student model and a second student model. As shown in fig. 2, the method includes the following steps S201 to S204:

steps S201 to S202 correspond to steps S101 to S102, respectively, and reference may be made to the specific embodiments of steps S101 to S102.

And step 203, compressing the data volume of the teacher model by adopting a preset compression mode to obtain the first student model.

Here, the compression method includes at least one of: pruning, quantization, encoding, etc. for compressing the data volume of the teacher model. The first student model may refer to a student model in which the data amount of the teacher model is compressed, the data amount of the first student model is smaller than the data amount of the teacher model, and simultaneously, the prediction accuracy of the first student model is smaller than the prediction accuracy of the teacher model. For example: pruning is carried out on the teacher model to obtain a pruned teacher model; quantizing the pruned teacher model to obtain a quantized teacher model; and encoding the quantized teacher model to obtain a first student model.

And step S204, based on the test set, performing knowledge distillation on the first student model by using the teacher model to obtain the second student model.

Here, the second student model may refer to a student model after knowledge distillation is performed on the first student model, the data size of the second student model is smaller than the data size of the teacher model, and at the same time, the difference between the prediction accuracy of the second student model and the prediction accuracy of the teacher model is smaller than a preset accuracy threshold, and the prediction accuracy of the second student model is greater than the prediction accuracy of the first student model. Knowledge distillation may include, but is not limited to, conventional knowledge distillation (Individual Knowledge Distillation, IKD), relational knowledge distillation (Relational Knowledge Distillation, RKD), self-monitored knowledge distillation (Self-Supervised Knowledge Distillation, SSKD), and the like. Relational knowledge distillation is used to determine distillation loss based on feature distance between samples, and self-supervised knowledge distillation is used to perform knowledge distillation with self-supervised learning as an auxiliary task.

In some embodiments, the step S204 may include the following steps S2041 to S2043:

step S2041, under the condition that the distillation temperature is set to be a first temperature, respectively predicting samples in a test set by using a teacher model and a first student model to obtain a first output result and a second output result; a first cross entropy between the first output result and the second output result is determined, and the first cross entropy is determined as a first Loss (disfigurement Loss).

Step S2042, under the condition that the distillation temperature is set to be 1, predicting samples in the test set by using a first student model to obtain a third output result; determining a second cross entropy between the feature for characterizing the third output result and the feature for characterizing the labeling information of the sample, determining the second cross entropy as a second Loss (Student Loss); wherein the first temperature is not equal to 1.

Step S2043, determining a third cross entropy between the first loss and the second loss, and determining the third cross entropy as the current total loss of the first student model; and adjusting the weight of the first model based on the current total loss to obtain a second student model.

Wherein the distillation Temperature (Temperature) is used in a logistic regression (Softmax) function to correct the extent of the output result. The larger the value of the distillation temperature is, the more gentle the distribution of the output results is, while the smaller the value of the distillation temperature is, the probability of erroneous classification is easily amplified, unnecessary noise is introduced, and the like.

In the embodiment of the disclosure, a first student model is obtained by compressing the data volume of a teacher model in a preset compression mode; and then, based on the test set, carrying out knowledge distillation on the first student model by using the teacher model to obtain a second student model. On the one hand, the storage requirement of the student model can be reduced in multiple under the condition of less influence on the accuracy of the student model; on the other hand, the student model is obtained by compressing the teacher model, so that the training efficiency of the student model is improved, and the like.

The embodiment of the present disclosure provides a third model training method, as shown in fig. 3, including the following steps S301 to S306:

steps S301 to S302 correspond to steps S201 to S202, respectively, and reference may be made to the specific embodiments of steps S201 to S202; step S306 corresponds to step S204 described above, and reference may be made to the specific embodiment of step S204 described above.

Step S303, pruning is carried out on the model structure or the model parameters of the teacher model, so as to obtain the pruned teacher model.

Here, the principle of the model construction may be: deeper models mean better non-linear expressivity, wider networks can produce more diverse features, and thus, models are increasingly wider and deeper. However, the increase in the depth and width of the model makes it difficult for the model calculation delay to meet the application standards. Meanwhile, when the deployment of the model is gradually expanded from a heavy server to terminal equipment such as wearable equipment, mobile phones, robots and unmanned aerial vehicles, the terminal equipment has strict limitation on computing capacity and memory space, so that the model is difficult to transplant or deploy. Model pruning may refer to reducing the number of operations involved in computation in a model, thereby reducing computation and memory stress. For example: pruning processing can be performed on a model structure or model parameters of the teacher model, the model structure can comprise neurons, channels, network layers and the like, and the model parameters can comprise parameters such as weights and the like.

Model pruning may include: unstructured pruning and structured pruning. Unstructured pruning may include weight pruning, vector pruning, kernel (Kernel) pruning, etc., which may cause irregularities in the model structure, which requires special hardware design to support sparse operations, but is finer and the prediction accuracy of the model after pruning is higher. Structural pruning may include: convolution kernel pruning, channel pruning, hierarchical pruning and the like, the structural pruning can be operated by only changing the number of convolution kernels and characteristic channels in the model, and special algorithm design is not needed. For example: the data volume of the teacher model is 100 megabytes, weight pruning processing is carried out on the teacher model, the pruned teacher model is obtained, and the data volume of the pruned teacher model is 60 megabytes.

In the related art, pruning achieves the goal of rapid compression model by removing smaller connected or inactive neurons in the weighted connections. The more neurons are removed during pruning, the faster the accuracy of the model drops, so fine tuning is often required to restore the performance of the model after pruning. However, the current trimming method is time-consuming and laborious, and has problems of data privileges and privacy, and thus is not applicable in practical scenarios. In the embodiment of the disclosure, by adopting a method of combining pruning and knowledge distillation, the requirement on the size of a data set in the process of fine tuning after pruning of a model in the related art is reduced, the training speed for recovering the prediction precision of a student model is accelerated, and the problems of prediction precision reduction generated by fine tuning of the student model after compression and the like are reduced.

And step S304, carrying out quantization processing on the weight of the pruned teacher model to obtain a quantized teacher model.

Here, the model quantization may refer to further compressing the pruned teacher model by reducing the number of bits required to represent each weight. Model quantization may include linear quantization, nonlinear quantization, layer-by-layer quantization, group-by-group quantization, channel-by-channel quantization, online quantization, offline quantization, bit quantization, weight activated quantization, and the like, without limitation.

As the speed is generally expected to be faster and better and the accuracy is higher and better when training the teacher model, the weight of the teacher model is stored in a floating point algorithm mode, and the high-configuration server is provided with a powerful graphic processor and a larger storage space, so that the library for accelerating the floating point algorithm is possible, but the processing capacity and the storage space of the embedded terminal equipment are insufficient for computing the high-precision floating point weight. By quantifying the weight of the teacher model, the data size of the teacher model can be reduced, and the running speed can be increased. For example: the weight is 32-bit floating point number, the weight of the teacher model is converted from 32-bit floating point number to 8-bit signed number, the memory bandwidth required by the teacher model running 8-bit signed number is only 25% of the memory bandwidth required by the teacher model running 32-bit floating point number, and the buffer memory can be better used by quantizing the teacher model and converting the weight from 32-bit floating point number to 8-bit signed number.

And step S305, carrying out coding processing on the quantized weight of the teacher model to obtain the first student model.

Here, encoding may refer to performing encoding transformation on the quantized weights of the teacher model to obtain the encoded teacher model. The encoding may include adaptive huffman encoding, transform encoding, shannon encoding, huffman encoding, and the like. For example: and (3) encoding the quantized weights of the teacher model by adopting a Huffman encoding mode to obtain the encoded teacher model. Huffman Coding (Huffman Coding) is a variable word length Coding mode, huffman Coding uses a variable length Coding table to code the weights of a teacher model, wherein the variable length Coding table is obtained by a method for evaluating the occurrence probability of weights, the weights with high occurrence probability use shorter codes, otherwise, the weights with low occurrence probability use longer codes, so that the average length and expected value of all the weights after Coding are reduced, and the aim of lossless compression of data is achieved.

In the embodiment of the disclosure, pruning is performed on a model structure or model parameters of a teacher model to obtain the pruned teacher model; carrying out quantization processing on the weight of the pruned teacher model to obtain a quantized teacher model; and carrying out coding processing on the quantized weights of the teacher model to obtain a first student model. Therefore, the data volume of the teacher model can be accurately compressed, and the compression efficiency of the teacher model is improved.

In some embodiments, the teacher model is a neural network model, and the step S303 may include the following steps S3031 to S3032:

step S3031, determining a weight matched by each neuron in the teacher model.

Here, the teacher model may be a neural network model including a plurality of neurons. Wherein each neuron corresponds to a weight, e.g., the weight of a first neuron in the third layer network to the fourth layer network is 0.1, and the weight of a second neuron in the third layer network to the fourth layer network is 0.3. In the implementation process of step S3031, the first weights of the current neuron for each neuron in the next layer may be determined, the weight average of all the first weights is determined, and the weight average is determined as the weight matched by the current neuron.

Step S3032, pruning is performed on the neurons with the weights smaller than the weight threshold value, so as to obtain the pruned teacher model.

Here, if the weight threshold is 0.4, the weight of the first neuron in the third layer network may be set to 0, that is, the first neuron in the third layer network is deleted, and the second neuron in the third layer network is reserved, so as to obtain the reduced teacher model. The weight of the pruned teacher model is stored as compressed sparse rows (Compressed Sparse Row, CSR) or compressed sparse columns (Compressed Sparse Column, CSC). For further compression, the absolute position in the compressed sparse rows or compressed sparse columns may be replaced by an index difference, and the index difference encoded as 8 bits and the full concatenated layer encoded as 5 bits. Zero padding may be used to resolve when the required index difference is greater than the limit. For example: if the index difference exceeds 8, a padding zero may be added if the maximum index difference is a 3-bit unsigned number.

In an embodiment of the disclosure, the weights matched by each neuron in the teacher model are determined; pruning treatment is carried out on neurons with the weight smaller than the weight threshold value, so that a pruned teacher model can be quickly and accurately obtained; the weight of the teacher model after pruning is stored in a mode of compressing sparse rows or compressed sparse columns.

In some embodiments, the step S304 may include the following steps S3041 to S3043:

and step S3041, reducing the precision of the weight to obtain the teacher model with the precision adjusted.

For example: the weight is 32-bit floating point number, and the teacher model weight is converted from 32-bit floating point number to 8-bit signed number; or deleting decimal places of the weights, determining integer weights corresponding to the weights, and the like.

And step S3042, classifying the weights of the teacher model after the accuracy adjustment by using a mean value clustering method to obtain weights of at least two groups of clusters, and identifying the sharing weight of the weights of each group of clusters.

Here, the mean clustering method (K-Means) is used to divide the ownership into K clusters, so that the intra-cluster distance of each cluster is small and the inter-cluster distance is large. For example: a network layer has 4 input neurons, and the weight matrix is 4×4, including a weight matrix and a gradient matrix. The weight matrix is clustered into 4 clusters, all weights in the same cluster share the same value, so for each weight only one small index needs to be stored into the shared weight table, so that the number of weight stores is reduced. When the weight is updated, all gradients can be grouped and summarized together according to clusters, the preset learning rate is multiplied, and the product is subtracted from the shared centroid of the last iteration to obtain an updated shared weight table and the like. For example: classifying the weights of the teacher model after the accuracy adjustment by using a mean value clustering method to obtain three groups of clustered weights, wherein the shared weight of the weights of the first group of clusters is (0.7,0.5), the shared weight of the weights of the second group of clusters is (0.4,0.6), the shared weight of the weights of the third group of clusters is (0.1, 0.3), and the like.

In some embodiments of the present invention, in some embodiments,and identifying the shared weight of each layer of the training model by using the K-Means cluster, so that all weights belonging to the same cluster share the same weight, and the weights are not shared across layers. Here, n original weights w= { W may be used ₁ ,w ₂ ,...,w _n Dividing into k clusters c= { C ₁ ,c ₂ ,...,c _n The sum of squares within the cluster can be minimized using the following equation }, n > k:

in formula (1), t may represent the sum of squares within the cluster, w may represent the original weight, and c may be the clustered weight.

And step S3043, replacing the weight of the same cluster in the teacher model after the precision adjustment with the corresponding shared weight to obtain the quantized teacher model.

For example: the sharing weight corresponding to the first neuron and the second neuron in the third layer network is 0.7, and the sharing weight corresponding to the third neuron and the fourth neuron in the third layer network is 0.5.

In the embodiment of the disclosure, the teacher model with the adjusted precision is obtained by reducing the precision of the weight; classifying the weights of the teacher model with the adjusted accuracy by using a mean value clustering method to obtain weights of at least two groups of clusters, and identifying the sharing weight of the weights of each group of clusters; the weights of the same cluster in the teacher model after the accuracy adjustment are replaced by the corresponding shared weights, so that the quantized teacher model can be obtained quickly and accurately.

In some embodiments, the step S3041 may include the following steps S311 to S312:

step S311, determining the number of bits used to characterize the number of bits of the weight.

For example: the weight of the current neuron is read as 32-bit floating point bit number. Converting the teacher model weight from a 32-bit floating point bit number to an 8-bit signed bit number, etc.

And step S312, when the bit number of the bit numbers is larger than the bit number threshold value, reducing the bit number of the bit numbers used for representing the weights, and obtaining the teacher model after the precision adjustment.

For example: and the preset bit number threshold value is 16, when the weight of the current neuron is 32-bit floating point bit number, the weight of the teacher model is converted from 32-bit floating point bit number to 8-bit signed bit number and the like, and the teacher model with the adjusted precision is obtained.

In some embodiments, to calculate the compression ratio, given k clusters, only log is needed here ₂ (k) Bits encode the index. In general, for a network with n connections, each represented by b bits, limiting the connections to only k shared weights, the compression rate can be determined using the following formula:

in formula (2), r may represent the number of clusters, k may represent the number of clusters, log ₂ (k) The coding index may be represented, n may represent the number of connected networks, and b may represent the individual connections in the model.

In an embodiment of the present disclosure, by determining the number of bits used to characterize the number of bits of the weight; when the number of bits is larger than the threshold value, the number of bits representing the weight is reduced, so that the teacher model with adjusted accuracy can be obtained quickly and accurately.

The embodiment of the present disclosure provides a fourth model training method, as shown in fig. 4, including the following steps S401 to S407:

steps S401 to S403 correspond to steps S201 to S203, respectively, and reference may be made to the specific embodiments of steps S201 to S203.

And step S404, predicting the test samples in the test set by using the teacher model to obtain a first prediction result of each test sample.

Here, in the case where the distillation temperature is set to the first temperature, each test sample in the test set may be input to the teacher model, resulting in a first prediction result for each test sample; wherein the first temperature is not equal to 1. For example: the test sample is an animal image and the first predictive result is a horse.

And step S405, predicting the test samples in the test set by using the first student model to obtain a second prediction result of each test sample.

Here, in the case where the distillation temperature is set to the first temperature, each test sample in the test set may be input to the first student model, resulting in a second prediction result for each test sample. For example: the test sample is an animal image and the second predicted result is a horse with the distillation temperature set to the first temperature. Meanwhile, under the condition that the distillation temperature is set to be 1, each test sample in the test set can be input into the first student model, and a second prediction result of each test sample is obtained. For example: the test sample is an animal image and the second predicted result is donkey with the distillation temperature set to 1.

Step S406, determining a distillation loss of the first student model based on the first prediction result and the second prediction result.

Here, a first characteristic distance between a first prediction result in the case where the distillation temperature is set to the first temperature and a second prediction result in the case where the distillation temperature is set to the first temperature may be determined, and the first characteristic distance may be determined as the first loss; determining a second characteristic distance between a second prediction result and labeling information of the test sample in the case that the distillation temperature is set to 1, and determining the second characteristic distance as a second loss; a weighted sum between the first loss and the second loss is determined, and the weighted sum is determined as a distillation loss for the first student model. Wherein the distillation loss may be a conventional distillation loss and a relational distillation loss, which may include a distance distillation loss, an angle distillation loss, and the like.

In some embodiments, conventional distillation losses may be determined using the following equation:

in the formula (3),can be represented by conventional distillation loss, l can be represented by a preset loss function, f _T (x _i ) Can represent a first prediction result, f _S (x _i ) Can represent a second prediction result, x _i A test sample may be represented.

In some embodiments, the relational distillation loss may be determined using the following equation:

in the formula (4) of the present invention,can be expressed as the loss of distillation, t _i ＝f _T (x _i )、s _i ＝f _S (x _i ) The first prediction result and the second prediction result may be represented, and ψ may represent a relationship potential function of n-tuple preset relationship energy.

In some embodiments, the relationship potential function may include a distance potential function and an angle potential function, which may be determined using the following equation:

in the formula (5), ψ _D The distance potential function may be represented, t may represent the first predicted result or the second predicted result, and μmay represent a normalization factor for the distance.

In some embodiments, the normalization factor for the distance may be determined using the following formula:

in equation (6), μmay represent a normalization factor for the distance, t may represent the first predicted result or the second predicted result,a small batch of test samples in a test set may be represented. Since knowledge distillation attempts to match the distance potential between the teacher model and the student model, such small-batch distance normalization is very useful, especially when the teacher distance t _i -t _j || ₂ And student distance s _i -s _j || ₂ When there is a significant difference in scale between the output results due to the difference in dimension, and normalization provides more stable and faster convergence in training.

In some embodiments, the distance distillation loss may be determined using the following equation:

in the formula (7) of the present invention,can be expressed as distance distillation loss, l _δ A Hu Ba loss may be indicated.

In some embodiments, hu Ba loss may be determined using the following equation:

in the formula (8), l _δ (x, y) may represent Hu Ba losses and x, y may represent input variables.

In some embodiments, the angular distillation loss may be determined using the following equation:

in the formula (9) of the present invention,can be expressed as angular distillation loss, l _δ Can represent Hu Ba loss, ψ _A An angular potential function may be represented.

In some embodiments, the total loss of the first student model may comprise a plurality of distillation losses, wherein the relationship distillation losses may be used alone or in combination with the task-specific loss function, and the total loss may be determined using the following formula:

in the formula (10) of the present invention,can represent total loss, < >>The specific loss of the preset task can be represented, and the teacher model and the student model can process different tasks, and the different tasks correspond to different specific losses; hu Ba loss (S) >Can represent the traditional distillation loss and/or the relational distillation loss, and the like, lambda _KD An adjustable super parameter of the balance loss term may be represented.

And step S407, updating the weight of the first student model based on the distillation loss to obtain the second student model.

Here, the current gradient corresponding to the current distillation weight may be determined; determining a product between the current gradient and a preset current step length, and determining the current adjustment quantity of the weight of the first student model by the product; a difference between the current weight of the first student model and the current adjustment amount is determined, the difference is determined as the current weight of the first student model, and the like.

In the embodiment of the disclosure, a first prediction result of each test sample is obtained by predicting the test samples in the test set by using a teacher model; predicting the test samples in the test set by using the first student model to obtain a second prediction result of each test sample; determining a distillation loss of the first student model based on the first prediction result and the second prediction result; wherein the distillation loss is distance distillation loss or angle distillation loss; the weight of the student model is updated by conventional distillation loss as compared with the related art. Here, the weight of the first student model is updated based on the distillation loss, so that the second student model can be obtained quickly and accurately, the training efficiency of the first student model is improved, and the like.

In some embodiments, the step S406 may include the following steps S4061 to S4063:

step S4061, determining a first euclidean distance between the first prediction results and a second euclidean distance between the second prediction results corresponding to the first prediction results.

Here, a Batch-sized relational structure output feature may be obtained by determining a feature distance between every two test samples input into each Batch (Batch) in the first student model. The feature distance may refer to a first euclidean distance between features corresponding to the test sample. Reasoning each test sample by using a first student model to obtain a first prediction result corresponding to each test sample; feature distances, such as a second Euclidean distance, between the prediction results of every two batches output by the first student model are determined.

Step S4062, determining Hu Ba loss between the first euclidean distance and the second euclidean distance.

Here, the first euclidean distance and the second euclidean distance have a corresponding relationship, for example, the first euclidean distance corresponds to the first test sample and the second test sample, and the second euclidean distance corresponds to the prediction result of the first test sample and the prediction result of the second test sample. A Hu Ba (Huber) penalty between the matching first and second euclidean distances may be determined.

Step S4063, determining the sum of all the Hu Ba losses as the distance distillation loss.

Here, the Hu Ba loss between all the matched first euclidean distances and the second euclidean distances may be superimposed to obtain a distance distillation loss.

In some embodiments, a first prediction result corresponding to each test sample may also be obtained by determining a first angle between any three test samples in each Batch (Batch) input into the first student model, and reasoning each test sample by using the first student model; determining a second angle between three predicted outcomes in each batch output by the first student model; determining Hu Ba losses between the matched first and second angles; and (3) carrying out superposition treatment on Hu Ba losses between all the matched first angles and the second angles to obtain angle distillation losses and the like.

In the embodiment of the disclosure, a first Euclidean distance between first prediction results and a second Euclidean distance between second prediction results corresponding to the first prediction results are determined; determining Hu Ba loss between the first euclidean distance and the second euclidean distance; the sum of all Hu Ba losses is determined to be the distance distillation loss, so that the distance distillation loss can be quickly and accurately obtained, and the training efficiency of the first student model is improved.

In some embodiments, the step S407 may include the following steps S4071 to S4072:

in step S4071, in a case where a difference between the prediction accuracy of the first student model and the prediction accuracy of the teacher model is greater than or equal to the accuracy threshold, an adjustment gradient determined by the distillation loss is acquired.

For example: and if the precision threshold value is 0.1 and the current difference value between the prediction precision of the first student model and the prediction precision of the teacher model is 0.2, continuing to adjust the weight of the first student model. For example: and determining an adjustment gradient corresponding to the distillation loss by using a gradient descent method. If it is determined that the current difference between the prediction accuracy of the first student model and the prediction accuracy of the teacher model is 0.05, the adjustment of the weight of the first student model may be stopped, and the current first student model is determined as the second student model.

And step S4072, updating the weight of the first student model based on the adjustment gradient and a preset adjustment step length to obtain the second student model.

Here, the preset adjustment step may be determined according to a preset learning rate and the number of times of weight adjustment. The product between the adjustment gradient and the adjustment step size can be determined, and the product is determined as the adjustment amount of the weight; a difference between the current weight and the current adjustment is determined, and the difference is determined as the weight of the updated first student model.

In the embodiment of the disclosure, an adjustment gradient determined by distillation loss is obtained under the condition that the difference between the prediction precision of the first student model and the prediction precision of the teacher model is greater than or equal to a precision threshold; based on the adjustment gradient and the preset adjustment step length, the weight of the first student model is updated, and the second student model can be obtained rapidly and accurately.

The application of model training provided by the embodiment of the present disclosure in an actual scenario is described below, taking a model training system scenario with model compression as an example. As shown in fig. 5, the model training system may include a pruning module 501, a quantization module 502, a huffman coding module 503, a relational knowledge distillation module 504, and the like.

Pruning module 501 may prune teacher model 505 to obtain a pruned teacher model; wherein, the teacher model 505 may be obtained by training the initial model by using a training set; the quantization module 502 may perform quantization processing on the pruned teacher model to obtain a quantized teacher model; the huffman coding module 503 may perform coding conversion on the quantized teacher model to obtain a coded teacher model; the encoded teacher model may be determined as a pre-distillation student model 506 (i.e., a first student model), and based on the relational knowledge distillation module 504, the pre-distillation student model 506 may be subjected to relational knowledge distillation based on the teacher model 505 to obtain a post-distillation student model (i.e., a second student model).

The embodiment of the present disclosure provides a fifth model training method, as shown in fig. 6, which may include the following steps S601 to S611:

step S601, a training set is acquired.

Here, the computer device may receive the labeled data set uploaded by the user, and divide the data set according to a preset allocation ratio to obtain a training set and a test set. The computer device may store the training set test set to a preset memory space.

Step S602, training the initial model based on the training set.

Here, the pruning module 501 may receive the initial model uploaded by the user, predict the training samples in the training set by using the initial model to obtain a prediction result, determine a loss of the initial model based on a feature distance between the prediction result and the labeling information of the training samples, and adjust parameters of the initial model based on the loss to obtain a trained initial model.

And step S603, pruning is carried out on the trained initial model, and a pruned teacher model is obtained.

Here, the pruning module 501 may determine the trained initial model as a teacher model, and perform pruning processing on the trained initial model to obtain a pruned teacher model; and updating the weight of the pruned teacher model again based on the training set to obtain a first updated teacher model, and determining the first updated teacher model as the pruned teacher model.

Step S604, quantifying the pruned teacher model to obtain a quantified teacher model.

Here, the quantization module 502 may determine different clusters of weights, determine a shared weight of the weights of each group of clusters based on the weights of the different clusters, and obtain a shared weight table (may also be referred to as a codebook); reducing the precision of each weight based on the sharing weight in the sharing weight table to obtain a teacher model with the precision adjusted; and updating the weight of the teacher model after the accuracy adjustment based on the training set again to obtain a teacher model after the second update, and determining the teacher model after the second update as a quantized teacher model and the like.

Step S605, encoding the pruned teacher model to obtain an encoded teacher model.

Here, the huffman coding module 503 may perform huffman coding on the weights, or may perform huffman coding on the indexes of the weights, to obtain the coded teacher model.

In step S606, a teacher model is determined.

Here, the relational knowledge distillation module 504 may determine the trained initial model as a teacher model.

In step S607, a test set is acquired.

Here, the test samples in the test set have the same scene as the training samples in the training set. The relational knowledge distillation module 504 may obtain a test set from a storage space used by a computer device to store the data set.

Step S608, a first student model is determined.

Here, the relational knowledge distillation module 504 may determine the teacher model after the encoding process as the first student model.

Step S609, based on the test set, performing relational knowledge distillation on the first student model by using the teacher model.

Here, the relational knowledge distillation module 504 may determine a distance distillation loss or an angle distillation loss of the first student model, adjust parameters of the first student model based on the distance distillation loss or the angle distillation loss, and obtain the second student model.

Step S610, a second student model is determined.

Here, the relational knowledge distillation module 504 may determine the first student model after relational knowledge distillation as the second student model.

In the embodiment of the disclosure, aiming at the problems that the model is difficult to deploy, the hardware resources are greatly occupied and the like, the teacher model is compressed by utilizing a model compression algorithm based on pruning, quantization, coding and relational knowledge distillation. The student model after the relation knowledge distillation can keep the high precision of the teacher model and even exceed the teacher model, so that the model can be compressed in multiple under the condition of less influence on the precision, and storage resources are effectively saved. Meanwhile, in the embodiment of the disclosure, knowledge extraction, namely relational knowledge distillation, is performed by a method different from traditional knowledge distillation, and the first student model is updated through distance distillation loss or angle distillation loss, so that the first student model is facilitated to stably and rapidly converge, a second student model is obtained, and the prediction precision of the second student model can be higher than that of the first student model through knowledge distillation.

Based on the foregoing embodiments, the embodiments of the present disclosure provide a model training apparatus, which includes units included, and modules included in the units, and may be implemented by a processor in a computer device; of course, the method can also be realized by a specific logic circuit; in practice, the processor may be a central processing unit (Central Processing Unit, CPU), microprocessor (MicroprocessorUnit, MPU), digital signal processor (Digital Signal Processor, DSP) or field programmable gate array (Field Programmable gate array, FPGA), or the like.

Fig. 7 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure, and as shown in fig. 7, a model training apparatus 700 includes: an acquisition model 710, a training module 720, and a compression module 730, wherein:

an acquisition model 710 for acquiring a training set, a test set, and an initial model; the training module 720 is configured to train the initial model by using the training set to obtain a teacher model; the compression module 730 is configured to perform compression processing on the teacher model at least twice based on the test set to obtain a student model; wherein a difference between the prediction accuracy of the student model and the prediction accuracy of the teacher model is less than an accuracy threshold; the student model is used for deploying to terminal equipment with hardware resources meeting preset conditions, and predicting data to be processed; the compression process includes at least one of: pruning, quantization, coding, knowledge distillation.

In some embodiments, the student model includes a first student model and a second student model; the compression module is further configured to: compressing the data volume of the teacher model by adopting a preset compression mode to obtain the first student model; wherein the compression mode comprises at least one of the following: pruning, quantizing and encoding; based on the test set, performing knowledge distillation on the first student model by using the teacher model to obtain the second student model; and the prediction precision of the second student model is larger than that of the first student model.

In some embodiments, the compression module is further configured to: pruning is carried out on the model structure or model parameters of the teacher model, so that a pruned teacher model is obtained; carrying out quantization processing on the weight of the pruned teacher model to obtain a quantized teacher model; and carrying out coding processing on the quantized weights of the teacher model to obtain the first student model.

In some embodiments, the teacher model is a neural network model; the compression module is further configured to: determining a weight that each neuron in the teacher model matches; pruning is carried out on the neurons with the weight smaller than a weight threshold value, and a teacher model after pruning is obtained; the weight of the pruned teacher model is stored in a compressed sparse row or compressed sparse column mode.

In some embodiments, the compression module is further configured to: reducing the precision of the weight to obtain a teacher model with the precision adjusted; classifying the weights of the teacher model subjected to the precision adjustment by using a mean value clustering method to obtain weights of at least two groups of clusters, and identifying the sharing weight of the weights of each group of clusters; and replacing the weight of the same cluster in the teacher model after the precision adjustment with the corresponding shared weight to obtain the quantized teacher model.

In some embodiments, the compression module is further configured to: predicting the test samples in the test set by using the teacher model to obtain a first prediction result of each test sample; predicting the test samples in the test set by using the first student model to obtain a second prediction result of each test sample; determining a distillation loss of the first student model based on the first prediction result and the second prediction result; wherein the distillation loss is a distance distillation loss or an angle distillation loss; and updating the weight of the first student model based on the distillation loss to obtain the second student model.

In some embodiments, the compression module is further configured to: determining a number of bits used to characterize the number of bits of the weight; and under the condition that the bit number of the bit number is larger than a bit number threshold value, reducing the bit number of the bit number used for representing the weight to obtain the teacher model after the precision adjustment.

In some embodiments, the compression module is further configured to: determining a first Euclidean distance between the first prediction results and a second Euclidean distance between the second prediction results corresponding to the first prediction results; determining Hu Ba loss between the first euclidean distance and the second euclidean distance; the sum of all the Hu Ba losses is determined as the distance distillation loss.

In some embodiments, the compression module is further configured to: obtaining an adjustment gradient determined by the distillation loss under the condition that the difference value between the prediction precision of the first student model and the prediction precision of the teacher model is greater than or equal to the precision threshold; and updating the weight of the first student model based on the adjustment gradient and a preset adjustment step length to obtain the second student model.

The description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. In some embodiments, functions or modules included in the apparatus provided by the embodiments of the present disclosure may be used to perform the methods described in the embodiments of the method, and for technical details not disclosed in the embodiments of the apparatus of the present disclosure, please understand with reference to the description of the embodiments of the method of the present disclosure.

It should be noted that, in the embodiment of the present disclosure, if the model training method is implemented in the form of a software functional module, and is sold or used as a separate product, the model training method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be essentially or portions contributing to the related art, and the software product may be stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present disclosure are not limited to any specific hardware, software, or firmware, or any combination of the three.

The disclosed embodiments provide a computer device comprising a memory storing a computer program executable on the processor and a processor implementing some or all of the steps of the above method when the processor executes the program.

The disclosed embodiments provide a computer readable storage medium having stored thereon a computer program which when executed by a processor performs some or all of the steps of the above method. The computer readable storage medium may be transitory or non-transitory.

The disclosed embodiments provide a computer program comprising computer readable code which, when run in a computer device, performs some or all of the steps for implementing the methods described above.

Embodiments of the present disclosure provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program which, when read and executed by a computer, performs some or all of the steps of the above-described method. The computer program product may be realized in particular by means of hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium, and in other embodiments the computer program product is embodied as a Software product, such as Software development kit (Software DevelopmentKit, SDK), or the like.

It should be noted here that: the above description of various embodiments is intended to emphasize the differences between the various embodiments, the same or similar features being referred to each other. The above description of apparatus, storage medium, computer program and computer program product embodiments is similar to that of method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the disclosed apparatus, storage medium, computer program and computer program product, please refer to the description of the embodiments of the disclosed method.

It should be noted that, fig. 8 is a schematic diagram of a hardware entity of a computer device in an embodiment of the disclosure, as shown in fig. 8, the hardware entity of the computer device 800 includes: a processor 801, a communication interface 802, and a memory 803, wherein:

the processor 801 generally controls the overall operation of the computer device 800.

The communication interface 802 may enable the computer device to communicate with other terminals or servers over a network.

The memory 803 is configured to store instructions and applications executable by the processor 801, and may also cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or processed by various modules in the processor 801 and the computer device 800, which may be implemented by a FLASH memory (FLASH) or a random access memory (RandomAccess Memory, RAM). Data may be transferred between processor 801, communication interface 802, and memory 803 via bus 804.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the size of the sequence numbers of the steps/processes described above does not mean the order of execution, and the order of execution of the steps/processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation of the embodiments of the present disclosure. The foregoing embodiment numbers of the present disclosure are merely for description and do not represent advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present disclosure may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the present disclosure may be embodied essentially or in part in a form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

The methods disclosed in the several method embodiments provided in the present disclosure may be arbitrarily combined without collision to obtain a new method embodiment.

If the embodiment of the disclosure relates to personal information, the product applying the embodiment of the disclosure clearly informs the personal information processing rule and obtains personal autonomous consent before processing the personal information. If the disclosed embodiments relate to sensitive personal information, the product to which the disclosed embodiments are applied has obtained individual consent before processing the sensitive personal information, and at the same time meets the requirement of "explicit consent".

The foregoing is merely an embodiment of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and any person skilled in the art can easily think about the changes or substitutions within the technical scope of the present disclosure, and should be covered by the protection scope of the present disclosure.

Claims

1. A method of model training, comprising:

acquiring a training set, a testing set and an initial model;

training the initial model by using the training set to obtain a teacher model;

based on the test set, compressing the teacher model at least twice to obtain a student model;

Wherein a difference between the prediction accuracy of the student model and the prediction accuracy of the teacher model is less than an accuracy threshold; the student model is used for deploying to terminal equipment with hardware resources meeting preset conditions, and predicting data to be processed; the compression process includes at least one of: pruning, quantization, coding, knowledge distillation.

2. The method of claim 1, wherein the student model comprises a first student model and a second student model; based on the test set, compressing the teacher model at least twice to obtain a student model, including:

compressing the data volume of the teacher model by adopting a preset compression mode to obtain the first student model; wherein the compression mode comprises at least one of the following: pruning, quantizing and encoding;

based on the test set, performing knowledge distillation on the first student model by using the teacher model to obtain the second student model;

and the prediction precision of the second student model is larger than that of the first student model.

3. The method of claim 2, wherein compressing the data volume of the teacher model by a preset compression method to obtain the first student model includes:

Pruning is carried out on the model structure or model parameters of the teacher model, so that a pruned teacher model is obtained;

carrying out quantization processing on the weight of the pruned teacher model to obtain a quantized teacher model;

and carrying out coding processing on the quantized weights of the teacher model to obtain the first student model.

4. A method according to claim 3, wherein the teacher model is a neural network model; pruning is carried out on the model structure or the model parameters of the teacher model to obtain a pruned teacher model, and the method comprises the following steps:

determining a weight that each neuron in the teacher model matches;

pruning is carried out on the neurons with the weight smaller than a weight threshold value, and a teacher model after pruning is obtained;

the weight of the pruned teacher model is stored in a compressed sparse row or compressed sparse column mode.

5. The method of claim 3, wherein the quantifying the weight of the pruned teacher model to obtain a quantized teacher model comprises:

reducing the precision of the weight to obtain a teacher model with the precision adjusted;

Classifying the weights of the teacher model subjected to the precision adjustment by using a mean value clustering method to obtain weights of at least two groups of clusters, and identifying the sharing weight of the weights of each group of clusters;

and replacing the weight of the same cluster in the teacher model after the precision adjustment with the corresponding shared weight to obtain the quantized teacher model.

6. The method according to any one of claims 2 to 5, wherein performing knowledge distillation on the first student model using the teacher model based on the test set to obtain the second student model comprises:

predicting the test samples in the test set by using the teacher model to obtain a first prediction result of each test sample;

predicting the test samples in the test set by using the first student model to obtain a second prediction result of each test sample;

determining a distillation loss of the first student model based on the first prediction result and the second prediction result; wherein the distillation loss is a distance distillation loss or an angle distillation loss;

and updating the weight of the first student model based on the distillation loss to obtain the second student model.

7. The method of claim 5, wherein the reducing the accuracy of the weights to obtain the accuracy-adjusted teacher model comprises:

determining a number of bits used to characterize the number of bits of the weight;

and under the condition that the bit number of the bit number is larger than a bit number threshold value, reducing the bit number of the bit number used for representing the weight to obtain the teacher model after the precision adjustment.

8. A model training device, comprising:

the method comprises the steps of obtaining a model, wherein the model is used for obtaining a training set, a testing set and an initial model;

the training module is used for training the initial model by utilizing the training set to obtain a teacher model;

the compression module is used for compressing the teacher model at least twice based on the test set to obtain a student model; wherein a difference between the prediction accuracy of the student model and the prediction accuracy of the teacher model is less than an accuracy threshold; the student model is used for deploying to terminal equipment with hardware resources meeting preset conditions, and predicting data to be processed; the compression process includes at least one of: pruning, quantization, coding, knowledge distillation.

9. A computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.