Disclosure of Invention
In view of this, the application provides a training method, a device, a network and a device terminal for an image recognition model, which can further combine the advantages of ViT networks based on the utilization of the traditional convolutional neural network structure, and combine the traditional convolutional neural network structure with ViT networks, so as to overcome the defect that the existing effective method in the field of computer vision cannot be directly combined with the novel vision method adopting ViT network.
A training method of an image recognition model, comprising:
Feature extraction is carried out on the input training image data set through the convolutional neural network so as to obtain a predictive label value corresponding to the convolutional neural network;
acquiring a characteristic diagram of the output of a plurality of intermediate layers in a convolutional neural network;
Respectively inputting the feature graphs output by each middle layer into corresponding preset ViT networks to perform feature extraction so as to obtain a predicted label value and a first preset loss function value corresponding to each preset ViT network;
Respectively carrying out weight and bias updating on each preset ViT network according to the corresponding first preset loss function value;
Calculating according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network to obtain an integrated predicted tag value;
calculating to obtain a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the real tag value;
Performing weight and bias updating on the convolutional neural network according to a second preset loss function value;
and circularly executing the steps until the second preset loss function converges to generate a corresponding image recognition model.
In one embodiment, the middle layer is a pooling layer, and the step of inputting the feature graphs output by each middle layer into the corresponding preset ViT networks to perform feature extraction to obtain the predicted tag value and the first preset loss function value corresponding to each preset ViT network includes:
Respectively inputting the feature graphs output by the pooling layers into the corresponding preset ViT networks to perform feature extraction so as to obtain the predictive label values corresponding to the preset ViT networks;
and calculating to obtain a first preset loss function value corresponding to each preset ViT network according to the first preset loss function, the predicted tag value corresponding to each preset ViT network and the real tag value.
In one embodiment, the step of performing feature extraction on the input training image dataset through the convolutional neural network to generate the corresponding predictive label value further comprises:
based on the cross entropy loss function, the training image data set is input into an initial convolutional neural network for training until the cross entropy loss function converges, and the convolutional neural network after training convergence is obtained.
In one embodiment, before the step of inputting the feature graphs output by each intermediate layer to the corresponding preset ViT networks to perform feature extraction to obtain the predicted tag value and the first preset loss function value corresponding to each preset ViT network, the method further includes:
based on the first preset loss function, the training image data set is input into each initial ViT network to train until the corresponding first preset loss function converges, and each preset ViT network after training convergence is obtained.
In one embodiment, the step of calculating the integrated predicted tag value according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network includes:
and weighting the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network to calculate and obtain an integrated predicted tag value.
In addition, a training network of the image recognition model is also provided, which comprises:
The convolutional neural network processing unit is used for extracting characteristics of an input training image data set through the convolutional neural network so as to obtain a predictive label value corresponding to the convolutional neural network;
ViT network processing units connected with the output ends of the plurality of middle layers in the convolutional neural network processing unit, and used for respectively inputting the feature graphs output by the middle layers into the corresponding preset ViT networks to perform feature extraction so as to obtain the predicted tag values and the first preset loss function values corresponding to the preset ViT networks;
The ViT network processing unit is further configured to update weights and offsets of the preset ViT networks according to the corresponding first preset loss function values;
The convolutional neural network processing unit is further configured to calculate an integrated predictive label value according to the predictive label value corresponding to the convolutional neural network and the predictive label value corresponding to each preset ViT network, calculate a second preset loss function value corresponding to the convolutional neural network according to the integrated predictive label value, a second preset loss function corresponding to the convolutional neural network, and the real label value, and update the weights and offsets of the convolutional neural network according to the second preset loss function value until the second preset loss function converges to generate a corresponding image recognition model.
In addition, still provide a training device of image recognition model, include:
the label value generating unit is used for extracting characteristics of the input training image data set through the convolutional neural network so as to obtain a predicted label value corresponding to the convolutional neural network;
the characteristic diagram obtaining unit is used for obtaining characteristic diagrams output by a plurality of middle layers in the convolutional neural network;
ViT a network feature extraction unit, configured to input the feature graphs output by each intermediate layer to a corresponding preset ViT network respectively for feature extraction, so as to obtain a predicted tag value and a first preset loss function value corresponding to each preset ViT network;
The first updating unit is used for respectively carrying out weight and bias updating on each preset ViT network according to the corresponding first preset loss function value;
The label value integration unit is used for calculating and obtaining integrated predicted label values according to the predicted label values corresponding to the convolutional neural network and the predicted label values corresponding to the preset ViT networks;
The loss function value generation unit is used for calculating a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the real tag value;
the second updating unit is used for updating the weight and the bias of the convolutional neural network according to a second preset loss function value;
and the model generation unit is used for generating a corresponding image recognition model when the second preset loss function converges.
In addition, the image recognition method is also provided, and the image recognition is carried out by adopting the image recognition model obtained by training by the training method.
Furthermore, there is provided a device terminal comprising a processor and a memory, the memory being for storing a computer program, the processor running the computer program to cause the device terminal to perform the training method as described above.
Further, a readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the training method described above.
According to the training method of the image recognition model, the feature map output by the convolutional neural network is input to each preset ViT network, so that the weight of each preset ViT network is directly optimized on the aspect of the feature map, meanwhile, the integrated predicted tag value is obtained through calculation according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network, the second preset loss function value corresponding to the convolutional neural network is obtained through calculation according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the actual tag value, the weight and bias updating are carried out on the convolutional neural network according to the second preset loss function value, the steps are circularly carried out, and finally, the corresponding image recognition model is generated when the second preset loss function converges, so that the image recognition model can be fully trained by utilizing each preset ViT network, recognition prediction results can be generated without depending on each preset ViT network, the image recognition model can be better adapted to each inference platform, various convolutional neural network structures in the existing computer vision domain and the corresponding convolutional neural network structures can still be better utilized, the novel visual model ViT can be combined with the novel visual model, and the method of the existing ViT can not be effectively combined with the novel visual model, and the novel visual model ViT can not be effectively achieved, and the novel visual model is achieved.
Detailed Description
The following description of the embodiments of the present application will be made in detail and with reference to the accompanying drawings, wherein it is apparent that the embodiments described are only some, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application. The various embodiments described below and their technical features can be combined with each other without conflict.
The ViT network model (Vision Transformer, visual converter) essentially divides an input image into a plurality of patches (small blocks) and marks each block of position information (namely sequence), after linear projection, the image is further sent to transformer encoder and transformer encoder through linear transformation, the essence is a multi-head self-attention mechanism, the relevance of each small block of image and other blocks of images is found, attention is focused (namely weight distribution) to the patch which is most closely related to the whole image, and then the information is integrated to obtain global characteristics and output a prediction label.
As shown in fig. 1, a training method of an image recognition model is provided, and the training method includes:
a training method of an image recognition model, comprising:
And step S110, performing feature extraction on the input training image data set through the convolutional neural network to obtain a predictive label value corresponding to the convolutional neural network.
When the convolutional neural network processes the training image data set, the convolutional neural network is usually required to continuously process the training image data set, so that feature extraction is performed, and a prediction label value corresponding to the convolutional neural network is obtained.
And step S120, obtaining a characteristic diagram of a plurality of middle layer outputs in the convolutional neural network.
When the convolutional neural network processes the training image data set, the training image data set is required to be subjected to convolutional processing continuously, so that a plurality of characteristic images can be output in the convolutional processing process, and the characteristic images output by a plurality of middle layers in the convolutional neural network can be obtained.
Step S130, the feature graphs output by the middle layers are respectively input into the corresponding preset ViT networks to perform feature extraction, so as to obtain the predicted tag values and the first preset loss function values corresponding to the preset ViT networks.
The input ends of the preset ViT networks are respectively connected with the middle layers in the convolutional neural network in a one-to-one correspondence manner, so that feature extraction is respectively carried out on feature graphs output by the middle layers, and second loss function values corresponding to the preset ViT networks can be obtained.
And step S140, respectively carrying out weight and bias updating on each preset ViT network according to the corresponding first preset loss function value.
In the training process, when feature extraction is performed by inputting the feature graphs output by the multiple intermediate layers in the convolutional neural network to each preset ViT network, the weights and offsets of each preset ViT network need to be updated synchronously, that is, each preset ViT network needs to be optimized independently in the training process, so that when feature extraction is performed on the feature graphs output by the multiple intermediate layers in the input convolutional neural network next time, the first preset loss function value can be reduced continuously to perform optimization.
And step S150, calculating to obtain an integrated predictive label value according to the predictive label value corresponding to the convolutional neural network and the predictive label value corresponding to each preset ViT network.
In order to further combine each preset ViT network model with the convolutional neural network, the convolutional neural network can adjust the parameters of the convolutional kernel according to the prediction results of each preset ViT network model in the training process, so that each convolutional layer extracts the most relevant feature map of the prediction results, and the accuracy of the output label is enhanced. At this time, the predicted tag values corresponding to the preset ViT networks and the predicted tag values corresponding to the convolutional neural network are required to be integrated, and then the integrated predicted tag values are calculated.
Step S160, calculating to obtain a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted label value, the second preset loss function corresponding to the convolutional neural network and the real label value.
After the integrated predicted tag value is obtained by calculation, the integrated predicted tag value, a second preset loss function corresponding to the convolutional neural network and a second preset loss function value corresponding to the real tag value are required to be calculated.
When the second preset loss function value is calculated, the second preset loss function value corresponding to the convolutional neural network is further calculated according to the integrated prediction label value, so that the convolutional neural network can refer to the prediction advantages of each preset ViT network model, and a foundation is laid for optimizing the subsequent convolutional neural network by using the prediction advantages of each preset ViT network model.
And S170, carrying out weight and bias updating on the convolutional neural network according to the second preset loss function value.
After the second preset loss function value is calculated, the weight and bias of the convolutional neural network can be updated by further solving the gradient through back propagation. Step S180, the steps are circularly executed until the second preset loss function converges to generate a corresponding image recognition model.
And performing the steps S110 to S150 in a circulating way, so that the convolutional neural network updates the weight and the bias of the convolutional neural network until a second preset loss function corresponding to the convolutional neural network is converged, and thus a corresponding image recognition model is generated.
In this embodiment, the training image dataset may be divided into a plurality of data subsets, then a learning rate is preset to train the convolutional neural network of one of the data subsets, then the weight and bias parameters of the convolutional neural network are updated, at this time, the parameters of the convolutional neural network have been updated, that is, the feature graphs output by each middle layer of the convolutional neural network have changed, then the updated convolutional neural network is used to infer the data subset again, steps S120 to S180 are performed to update the weight and bias of each preset ViT network, and the weight and bias of the convolutional neural network, then the above process is repeated to train a plurality of data subsets until the second preset loss function of the convolutional neural network converges, and finally the corresponding image recognition model is generated.
The generated image recognition model does not need to include the preset ViT networks.
According to the training method of the image recognition model, the feature map output by the convolutional neural network is input to each preset ViT network, so that the weight of each preset ViT network is directly optimized on the aspect of the feature map, meanwhile, the integrated predicted tag value is obtained through calculation according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network, the second preset loss function value corresponding to the convolutional neural network is obtained through calculation according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the real tag value, the weight and bias update is carried out on the convolutional neural network according to the second preset loss function value, and finally, the corresponding image recognition model is generated when the second preset loss function converges, so that the image recognition model can be fully trained by utilizing each preset ViT network, recognition prediction results can be generated without depending on each preset ViT network, the image recognition model can be better adapted to each inference platform, various excellent convolutional network structures in the existing computer vision field and corresponding training methods can be still utilized, the shortcomings of the traditional convolutional neural network structures and ViT can be effectively combined with the novel visual model ViT in the visual field through the method.
In one embodiment, the middle layer is a pooling layer, as shown in fig. 2, step S130 includes:
Step S132, the feature graphs output by the pooling layers are respectively input into the corresponding preset ViT networks to perform feature extraction so as to obtain the predicted tag values corresponding to the preset ViT networks.
Step S134, according to the first preset loss function, the predicted tag value and the real tag value corresponding to each preset ViT network, calculating to obtain the first preset loss function value corresponding to each preset ViT network.
The pooling layer plays a role of compressing feature graphs in the convolutional neural network, the feature graphs of all scales are contained, the middle layer is selected as the pooling layer, the purpose of training all preset ViT networks by fully utilizing the feature graphs of all scales is achieved, on the basis that the middle layer is selected as the pooling layer, feature extraction is carried out by inputting the feature graphs output by all the pooling layers into all preset ViT networks, so that corresponding predicted tag values are obtained, and further, first loss function values corresponding to all preset ViT networks are obtained according to a first preset loss function, all the predicted tag values and corresponding real tag values, so that a foundation is laid for a subsequent training process.
In one embodiment, the pooling layer is a maximum pooling layer, and the convolutional neural network is a VGG16 backbone network.
In one embodiment, the second preset loss function employs a cross entropy loss function, as shown in fig. 3, and before step S110, further includes:
step S190, based on the cross entropy loss function, the training image data set is input into the initial convolutional neural network for training until the cross entropy loss function converges, and the convolutional neural network after training convergence is obtained.
Wherein the convolutional neural network itself has been trained separately to converge, typically before step S110 is performed.
In one embodiment, as shown in fig. 4, step S130 further includes:
Step S200, based on the first preset loss function, the training image data set is input to each initial ViT network to train until the corresponding first preset loss function converges, and each preset ViT network after training convergence is obtained.
In one embodiment, step S150 includes: and weighting the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network to calculate and obtain an integrated predicted tag value.
The integrated predictive tag values are calculated, so that the predictive tag values corresponding to ViT networks can be used as a part of losses to influence the backward propagation of the convolutional neural network, and the effective fusion of the traditional convolutional neural network structure and the ViT network model is realized.
In addition, as shown in fig. 5, there is also provided a training network 210 of an image recognition model, including:
The convolutional neural network processing unit 220 is configured to perform feature extraction on an input training image data set through a convolutional neural network, so as to obtain a predicted tag value corresponding to the convolutional neural network;
ViT a network processing unit 230, connected to the output ends of the multiple intermediate layers in the convolutional neural network processing unit, for respectively inputting the feature graphs output by the intermediate layers into the corresponding preset ViT networks to perform feature extraction, so as to obtain the predicted tag values and the first preset loss function values corresponding to the preset ViT networks;
The ViT network processing unit 230 is further configured to update weights and offsets of the preset ViT networks according to the corresponding first preset loss function values;
The convolutional neural network processing unit 220 is further configured to calculate an integrated predictive label value according to the predictive label value corresponding to the convolutional neural network and the predictive label value corresponding to each preset ViT network, calculate a second preset loss function value corresponding to the convolutional neural network according to the integrated predictive label value, a second preset loss function corresponding to the convolutional neural network, and the real label value, and update the weights and offsets of the convolutional neural network according to the second preset loss function value until the second preset loss function converges to generate a corresponding image recognition model.
In addition, as shown in fig. 6, there is also provided a training apparatus 300 for an image recognition model, including:
the tag value generating unit 310 is configured to perform feature extraction on the input training image data set through the convolutional neural network, so as to obtain a predicted tag value corresponding to the convolutional neural network;
A feature map obtaining unit 320, configured to obtain feature maps output by a plurality of intermediate layers in the convolutional neural network;
ViT a network feature extraction unit 330, configured to input the feature graphs output by each intermediate layer to a corresponding preset ViT network respectively for feature extraction, so as to obtain a predicted tag value and a first preset loss function value corresponding to each preset ViT network;
a first updating unit 340, configured to update weights and offsets of the preset ViT networks according to the corresponding first preset loss function values;
the tag value integrating unit 350 is configured to calculate an integrated predicted tag value according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network;
a loss function value generating unit 360, configured to calculate a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network, and the real tag value;
a second updating unit 370, configured to update weights and offsets of the convolutional neural network according to a second preset loss function value;
the model generating unit 380 is configured to generate a corresponding image recognition model when the second preset loss function converges.
In addition, the image recognition method is also provided, and the image recognition is carried out by adopting the image recognition model obtained by training by the training method.
Furthermore, there is provided a device terminal comprising a processor and a memory, the memory being for storing a computer program, the processor running the computer program to cause the device terminal to perform the training method as described above.
Further, a readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the training method described above.
The division of the individual elements in the exercise device is for illustration only, and in other embodiments the exercise device may be divided into different elements as desired to perform all or part of the exercise device's functions. Specific limitations regarding the training device described above may be found in the limitations of the method described above, and are not repeated here.
That is, the foregoing embodiments of the present application are not limited to the patent scope of the application, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present application, such as combining technical features of the embodiments, or directly or indirectly using other related technical fields, are included in the scope of the present application.
In addition, the present application may be identified by the same or different reference numerals for structural elements having the same or similar characteristics. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the present application, the word "e.g." is used to mean "serving as an example, instance, or illustration". Any embodiment described as "for example" in this disclosure is not necessarily to be construed as preferred or advantageous over other embodiments. The previous description is provided to enable any person skilled in the art to make or use the present application. In the above description, various details are set forth for purposes of explanation.
It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes have not been shown in detail to avoid unnecessarily obscuring the description of the application. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.