[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114463559B - Training method and device of image recognition model, network and image recognition method - Google Patents

Training method and device of image recognition model, network and image recognition method Download PDF

Info

Publication number
CN114463559B
CN114463559B CN202210110008.8A CN202210110008A CN114463559B CN 114463559 B CN114463559 B CN 114463559B CN 202210110008 A CN202210110008 A CN 202210110008A CN 114463559 B CN114463559 B CN 114463559B
Authority
CN
China
Prior art keywords
preset
vit
convolutional neural
neural network
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210110008.8A
Other languages
Chinese (zh)
Other versions
CN114463559A (en
Inventor
申啸尘
周有喜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Core Computing Integrated Shenzhen Technology Co ltd
Original Assignee
Core Computing Integrated Shenzhen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Core Computing Integrated Shenzhen Technology Co ltd filed Critical Core Computing Integrated Shenzhen Technology Co ltd
Priority to CN202210110008.8A priority Critical patent/CN114463559B/en
Publication of CN114463559A publication Critical patent/CN114463559A/en
Application granted granted Critical
Publication of CN114463559B publication Critical patent/CN114463559B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The training method comprises the steps of respectively inputting feature maps output by a plurality of middle layers in a convolutional neural network into corresponding preset ViT networks to perform feature extraction so as to obtain predicted tag values and first preset loss function values corresponding to all preset ViT networks, respectively carrying out weight and bias updating on all preset ViT networks, calculating to obtain integrated predicted tag values according to the predicted tag values corresponding to the convolutional neural network and the predicted tag values corresponding to all preset ViT networks, calculating to obtain second preset loss function values corresponding to the convolutional neural network according to the integrated predicted tag values, the second preset loss function values corresponding to the convolutional neural network and the real tag values, and generating an image recognition model, so that a traditional convolutional neural network structure and a ViT network can be fused.

Description

Training method and device of image recognition model, network and image recognition method
Technical Field
The application relates to the field of image recognition, in particular to a training method and device of an image recognition model, a network, an image recognition method and equipment terminal.
Background
Currently, the use of ViT network models (Vision Transformer, visual converters) in computer vision to replace CNNs (Convolutional Neural Networks ) is a hotspot for computer vision research, and ViT network models essentially focus on important information of each part in a picture by using a visual self-attention network mechanism, so as to output corresponding prediction results.
Because the method is novel, a plurality of special operators which are not commonly used in the convolutional neural network or have low occurrence frequency are used, and the special operators cannot be supported by mobile terminal equipment, the effective method in the field of computer vision cannot be directly combined with the novel vision method adopting the ViT network model.
Disclosure of Invention
In view of this, the application provides a training method, a device, a network and a device terminal for an image recognition model, which can further combine the advantages of ViT networks based on the utilization of the traditional convolutional neural network structure, and combine the traditional convolutional neural network structure with ViT networks, so as to overcome the defect that the existing effective method in the field of computer vision cannot be directly combined with the novel vision method adopting ViT network.
A training method of an image recognition model, comprising:
Feature extraction is carried out on the input training image data set through the convolutional neural network so as to obtain a predictive label value corresponding to the convolutional neural network;
acquiring a characteristic diagram of the output of a plurality of intermediate layers in a convolutional neural network;
Respectively inputting the feature graphs output by each middle layer into corresponding preset ViT networks to perform feature extraction so as to obtain a predicted label value and a first preset loss function value corresponding to each preset ViT network;
Respectively carrying out weight and bias updating on each preset ViT network according to the corresponding first preset loss function value;
Calculating according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network to obtain an integrated predicted tag value;
calculating to obtain a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the real tag value;
Performing weight and bias updating on the convolutional neural network according to a second preset loss function value;
and circularly executing the steps until the second preset loss function converges to generate a corresponding image recognition model.
In one embodiment, the middle layer is a pooling layer, and the step of inputting the feature graphs output by each middle layer into the corresponding preset ViT networks to perform feature extraction to obtain the predicted tag value and the first preset loss function value corresponding to each preset ViT network includes:
Respectively inputting the feature graphs output by the pooling layers into the corresponding preset ViT networks to perform feature extraction so as to obtain the predictive label values corresponding to the preset ViT networks;
and calculating to obtain a first preset loss function value corresponding to each preset ViT network according to the first preset loss function, the predicted tag value corresponding to each preset ViT network and the real tag value.
In one embodiment, the step of performing feature extraction on the input training image dataset through the convolutional neural network to generate the corresponding predictive label value further comprises:
based on the cross entropy loss function, the training image data set is input into an initial convolutional neural network for training until the cross entropy loss function converges, and the convolutional neural network after training convergence is obtained.
In one embodiment, before the step of inputting the feature graphs output by each intermediate layer to the corresponding preset ViT networks to perform feature extraction to obtain the predicted tag value and the first preset loss function value corresponding to each preset ViT network, the method further includes:
based on the first preset loss function, the training image data set is input into each initial ViT network to train until the corresponding first preset loss function converges, and each preset ViT network after training convergence is obtained.
In one embodiment, the step of calculating the integrated predicted tag value according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network includes:
and weighting the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network to calculate and obtain an integrated predicted tag value.
In addition, a training network of the image recognition model is also provided, which comprises:
The convolutional neural network processing unit is used for extracting characteristics of an input training image data set through the convolutional neural network so as to obtain a predictive label value corresponding to the convolutional neural network;
ViT network processing units connected with the output ends of the plurality of middle layers in the convolutional neural network processing unit, and used for respectively inputting the feature graphs output by the middle layers into the corresponding preset ViT networks to perform feature extraction so as to obtain the predicted tag values and the first preset loss function values corresponding to the preset ViT networks;
The ViT network processing unit is further configured to update weights and offsets of the preset ViT networks according to the corresponding first preset loss function values;
The convolutional neural network processing unit is further configured to calculate an integrated predictive label value according to the predictive label value corresponding to the convolutional neural network and the predictive label value corresponding to each preset ViT network, calculate a second preset loss function value corresponding to the convolutional neural network according to the integrated predictive label value, a second preset loss function corresponding to the convolutional neural network, and the real label value, and update the weights and offsets of the convolutional neural network according to the second preset loss function value until the second preset loss function converges to generate a corresponding image recognition model.
In addition, still provide a training device of image recognition model, include:
the label value generating unit is used for extracting characteristics of the input training image data set through the convolutional neural network so as to obtain a predicted label value corresponding to the convolutional neural network;
the characteristic diagram obtaining unit is used for obtaining characteristic diagrams output by a plurality of middle layers in the convolutional neural network;
ViT a network feature extraction unit, configured to input the feature graphs output by each intermediate layer to a corresponding preset ViT network respectively for feature extraction, so as to obtain a predicted tag value and a first preset loss function value corresponding to each preset ViT network;
The first updating unit is used for respectively carrying out weight and bias updating on each preset ViT network according to the corresponding first preset loss function value;
The label value integration unit is used for calculating and obtaining integrated predicted label values according to the predicted label values corresponding to the convolutional neural network and the predicted label values corresponding to the preset ViT networks;
The loss function value generation unit is used for calculating a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the real tag value;
the second updating unit is used for updating the weight and the bias of the convolutional neural network according to a second preset loss function value;
and the model generation unit is used for generating a corresponding image recognition model when the second preset loss function converges.
In addition, the image recognition method is also provided, and the image recognition is carried out by adopting the image recognition model obtained by training by the training method.
Furthermore, there is provided a device terminal comprising a processor and a memory, the memory being for storing a computer program, the processor running the computer program to cause the device terminal to perform the training method as described above.
Further, a readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the training method described above.
According to the training method of the image recognition model, the feature map output by the convolutional neural network is input to each preset ViT network, so that the weight of each preset ViT network is directly optimized on the aspect of the feature map, meanwhile, the integrated predicted tag value is obtained through calculation according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network, the second preset loss function value corresponding to the convolutional neural network is obtained through calculation according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the actual tag value, the weight and bias updating are carried out on the convolutional neural network according to the second preset loss function value, the steps are circularly carried out, and finally, the corresponding image recognition model is generated when the second preset loss function converges, so that the image recognition model can be fully trained by utilizing each preset ViT network, recognition prediction results can be generated without depending on each preset ViT network, the image recognition model can be better adapted to each inference platform, various convolutional neural network structures in the existing computer vision domain and the corresponding convolutional neural network structures can still be better utilized, the novel visual model ViT can be combined with the novel visual model, and the method of the existing ViT can not be effectively combined with the novel visual model, and the novel visual model ViT can not be effectively achieved, and the novel visual model is achieved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a training method of an image recognition model according to an embodiment of the present application;
Fig. 2 is a flowchart of a method for obtaining a first loss function value corresponding to each preset ViT network according to an embodiment of the present application;
FIG. 3 is a flowchart of another training method of an image recognition model according to an embodiment of the present application;
FIG. 4 is a flowchart of a training method of an image recognition model according to an embodiment of the present application;
FIG. 5 is a block diagram of a training network for an image recognition model according to an embodiment of the present application;
fig. 6 is a block diagram of a training device for an image recognition model according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made in detail and with reference to the accompanying drawings, wherein it is apparent that the embodiments described are only some, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application. The various embodiments described below and their technical features can be combined with each other without conflict.
The ViT network model (Vision Transformer, visual converter) essentially divides an input image into a plurality of patches (small blocks) and marks each block of position information (namely sequence), after linear projection, the image is further sent to transformer encoder and transformer encoder through linear transformation, the essence is a multi-head self-attention mechanism, the relevance of each small block of image and other blocks of images is found, attention is focused (namely weight distribution) to the patch which is most closely related to the whole image, and then the information is integrated to obtain global characteristics and output a prediction label.
As shown in fig. 1, a training method of an image recognition model is provided, and the training method includes:
a training method of an image recognition model, comprising:
And step S110, performing feature extraction on the input training image data set through the convolutional neural network to obtain a predictive label value corresponding to the convolutional neural network.
When the convolutional neural network processes the training image data set, the convolutional neural network is usually required to continuously process the training image data set, so that feature extraction is performed, and a prediction label value corresponding to the convolutional neural network is obtained.
And step S120, obtaining a characteristic diagram of a plurality of middle layer outputs in the convolutional neural network.
When the convolutional neural network processes the training image data set, the training image data set is required to be subjected to convolutional processing continuously, so that a plurality of characteristic images can be output in the convolutional processing process, and the characteristic images output by a plurality of middle layers in the convolutional neural network can be obtained.
Step S130, the feature graphs output by the middle layers are respectively input into the corresponding preset ViT networks to perform feature extraction, so as to obtain the predicted tag values and the first preset loss function values corresponding to the preset ViT networks.
The input ends of the preset ViT networks are respectively connected with the middle layers in the convolutional neural network in a one-to-one correspondence manner, so that feature extraction is respectively carried out on feature graphs output by the middle layers, and second loss function values corresponding to the preset ViT networks can be obtained.
And step S140, respectively carrying out weight and bias updating on each preset ViT network according to the corresponding first preset loss function value.
In the training process, when feature extraction is performed by inputting the feature graphs output by the multiple intermediate layers in the convolutional neural network to each preset ViT network, the weights and offsets of each preset ViT network need to be updated synchronously, that is, each preset ViT network needs to be optimized independently in the training process, so that when feature extraction is performed on the feature graphs output by the multiple intermediate layers in the input convolutional neural network next time, the first preset loss function value can be reduced continuously to perform optimization.
And step S150, calculating to obtain an integrated predictive label value according to the predictive label value corresponding to the convolutional neural network and the predictive label value corresponding to each preset ViT network.
In order to further combine each preset ViT network model with the convolutional neural network, the convolutional neural network can adjust the parameters of the convolutional kernel according to the prediction results of each preset ViT network model in the training process, so that each convolutional layer extracts the most relevant feature map of the prediction results, and the accuracy of the output label is enhanced. At this time, the predicted tag values corresponding to the preset ViT networks and the predicted tag values corresponding to the convolutional neural network are required to be integrated, and then the integrated predicted tag values are calculated.
Step S160, calculating to obtain a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted label value, the second preset loss function corresponding to the convolutional neural network and the real label value.
After the integrated predicted tag value is obtained by calculation, the integrated predicted tag value, a second preset loss function corresponding to the convolutional neural network and a second preset loss function value corresponding to the real tag value are required to be calculated.
When the second preset loss function value is calculated, the second preset loss function value corresponding to the convolutional neural network is further calculated according to the integrated prediction label value, so that the convolutional neural network can refer to the prediction advantages of each preset ViT network model, and a foundation is laid for optimizing the subsequent convolutional neural network by using the prediction advantages of each preset ViT network model.
And S170, carrying out weight and bias updating on the convolutional neural network according to the second preset loss function value.
After the second preset loss function value is calculated, the weight and bias of the convolutional neural network can be updated by further solving the gradient through back propagation. Step S180, the steps are circularly executed until the second preset loss function converges to generate a corresponding image recognition model.
And performing the steps S110 to S150 in a circulating way, so that the convolutional neural network updates the weight and the bias of the convolutional neural network until a second preset loss function corresponding to the convolutional neural network is converged, and thus a corresponding image recognition model is generated.
In this embodiment, the training image dataset may be divided into a plurality of data subsets, then a learning rate is preset to train the convolutional neural network of one of the data subsets, then the weight and bias parameters of the convolutional neural network are updated, at this time, the parameters of the convolutional neural network have been updated, that is, the feature graphs output by each middle layer of the convolutional neural network have changed, then the updated convolutional neural network is used to infer the data subset again, steps S120 to S180 are performed to update the weight and bias of each preset ViT network, and the weight and bias of the convolutional neural network, then the above process is repeated to train a plurality of data subsets until the second preset loss function of the convolutional neural network converges, and finally the corresponding image recognition model is generated.
The generated image recognition model does not need to include the preset ViT networks.
According to the training method of the image recognition model, the feature map output by the convolutional neural network is input to each preset ViT network, so that the weight of each preset ViT network is directly optimized on the aspect of the feature map, meanwhile, the integrated predicted tag value is obtained through calculation according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network, the second preset loss function value corresponding to the convolutional neural network is obtained through calculation according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the real tag value, the weight and bias update is carried out on the convolutional neural network according to the second preset loss function value, and finally, the corresponding image recognition model is generated when the second preset loss function converges, so that the image recognition model can be fully trained by utilizing each preset ViT network, recognition prediction results can be generated without depending on each preset ViT network, the image recognition model can be better adapted to each inference platform, various excellent convolutional network structures in the existing computer vision field and corresponding training methods can be still utilized, the shortcomings of the traditional convolutional neural network structures and ViT can be effectively combined with the novel visual model ViT in the visual field through the method.
In one embodiment, the middle layer is a pooling layer, as shown in fig. 2, step S130 includes:
Step S132, the feature graphs output by the pooling layers are respectively input into the corresponding preset ViT networks to perform feature extraction so as to obtain the predicted tag values corresponding to the preset ViT networks.
Step S134, according to the first preset loss function, the predicted tag value and the real tag value corresponding to each preset ViT network, calculating to obtain the first preset loss function value corresponding to each preset ViT network.
The pooling layer plays a role of compressing feature graphs in the convolutional neural network, the feature graphs of all scales are contained, the middle layer is selected as the pooling layer, the purpose of training all preset ViT networks by fully utilizing the feature graphs of all scales is achieved, on the basis that the middle layer is selected as the pooling layer, feature extraction is carried out by inputting the feature graphs output by all the pooling layers into all preset ViT networks, so that corresponding predicted tag values are obtained, and further, first loss function values corresponding to all preset ViT networks are obtained according to a first preset loss function, all the predicted tag values and corresponding real tag values, so that a foundation is laid for a subsequent training process.
In one embodiment, the pooling layer is a maximum pooling layer, and the convolutional neural network is a VGG16 backbone network.
In one embodiment, the second preset loss function employs a cross entropy loss function, as shown in fig. 3, and before step S110, further includes:
step S190, based on the cross entropy loss function, the training image data set is input into the initial convolutional neural network for training until the cross entropy loss function converges, and the convolutional neural network after training convergence is obtained.
Wherein the convolutional neural network itself has been trained separately to converge, typically before step S110 is performed.
In one embodiment, as shown in fig. 4, step S130 further includes:
Step S200, based on the first preset loss function, the training image data set is input to each initial ViT network to train until the corresponding first preset loss function converges, and each preset ViT network after training convergence is obtained.
In one embodiment, step S150 includes: and weighting the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network to calculate and obtain an integrated predicted tag value.
The integrated predictive tag values are calculated, so that the predictive tag values corresponding to ViT networks can be used as a part of losses to influence the backward propagation of the convolutional neural network, and the effective fusion of the traditional convolutional neural network structure and the ViT network model is realized.
In addition, as shown in fig. 5, there is also provided a training network 210 of an image recognition model, including:
The convolutional neural network processing unit 220 is configured to perform feature extraction on an input training image data set through a convolutional neural network, so as to obtain a predicted tag value corresponding to the convolutional neural network;
ViT a network processing unit 230, connected to the output ends of the multiple intermediate layers in the convolutional neural network processing unit, for respectively inputting the feature graphs output by the intermediate layers into the corresponding preset ViT networks to perform feature extraction, so as to obtain the predicted tag values and the first preset loss function values corresponding to the preset ViT networks;
The ViT network processing unit 230 is further configured to update weights and offsets of the preset ViT networks according to the corresponding first preset loss function values;
The convolutional neural network processing unit 220 is further configured to calculate an integrated predictive label value according to the predictive label value corresponding to the convolutional neural network and the predictive label value corresponding to each preset ViT network, calculate a second preset loss function value corresponding to the convolutional neural network according to the integrated predictive label value, a second preset loss function corresponding to the convolutional neural network, and the real label value, and update the weights and offsets of the convolutional neural network according to the second preset loss function value until the second preset loss function converges to generate a corresponding image recognition model.
In addition, as shown in fig. 6, there is also provided a training apparatus 300 for an image recognition model, including:
the tag value generating unit 310 is configured to perform feature extraction on the input training image data set through the convolutional neural network, so as to obtain a predicted tag value corresponding to the convolutional neural network;
A feature map obtaining unit 320, configured to obtain feature maps output by a plurality of intermediate layers in the convolutional neural network;
ViT a network feature extraction unit 330, configured to input the feature graphs output by each intermediate layer to a corresponding preset ViT network respectively for feature extraction, so as to obtain a predicted tag value and a first preset loss function value corresponding to each preset ViT network;
a first updating unit 340, configured to update weights and offsets of the preset ViT networks according to the corresponding first preset loss function values;
the tag value integrating unit 350 is configured to calculate an integrated predicted tag value according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network;
a loss function value generating unit 360, configured to calculate a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network, and the real tag value;
a second updating unit 370, configured to update weights and offsets of the convolutional neural network according to a second preset loss function value;
the model generating unit 380 is configured to generate a corresponding image recognition model when the second preset loss function converges.
In addition, the image recognition method is also provided, and the image recognition is carried out by adopting the image recognition model obtained by training by the training method.
Furthermore, there is provided a device terminal comprising a processor and a memory, the memory being for storing a computer program, the processor running the computer program to cause the device terminal to perform the training method as described above.
Further, a readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the training method described above.
The division of the individual elements in the exercise device is for illustration only, and in other embodiments the exercise device may be divided into different elements as desired to perform all or part of the exercise device's functions. Specific limitations regarding the training device described above may be found in the limitations of the method described above, and are not repeated here.
That is, the foregoing embodiments of the present application are not limited to the patent scope of the application, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present application, such as combining technical features of the embodiments, or directly or indirectly using other related technical fields, are included in the scope of the present application.
In addition, the present application may be identified by the same or different reference numerals for structural elements having the same or similar characteristics. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the present application, the word "e.g." is used to mean "serving as an example, instance, or illustration". Any embodiment described as "for example" in this disclosure is not necessarily to be construed as preferred or advantageous over other embodiments. The previous description is provided to enable any person skilled in the art to make or use the present application. In the above description, various details are set forth for purposes of explanation.
It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes have not been shown in detail to avoid unnecessarily obscuring the description of the application. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Claims (10)

1. A method for training an image recognition model, comprising:
performing feature extraction on an input training image data set through a convolutional neural network to obtain a predictive label value corresponding to the convolutional neural network;
Acquiring characteristic graphs output by a plurality of middle layers in the convolutional neural network;
Respectively inputting the feature graphs output by each middle layer into corresponding preset ViT networks to perform feature extraction so as to obtain a predicted label value and a first preset loss function value corresponding to each preset ViT network;
Respectively carrying out weight and bias updating on each preset ViT network according to the corresponding first preset loss function value;
Calculating to obtain an integrated predictive label value according to the predictive label value corresponding to the convolutional neural network and the predictive label value corresponding to each preset ViT network;
Calculating a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted label value, the second preset loss function corresponding to the convolutional neural network and the real label value;
performing weight and bias updating on the convolutional neural network according to the second preset loss function value;
and circularly executing the steps until the second preset loss function converges to generate a corresponding image recognition model.
2. The training method according to claim 1, wherein the middle layers are pooling layers, and the step of inputting the feature graphs output by each middle layer into the corresponding preset ViT networks respectively to perform feature extraction to obtain the predicted tag values and the first preset loss function values corresponding to the preset ViT networks respectively includes:
Respectively inputting the feature graphs output by the pooling layers into respective corresponding preset ViT networks to perform feature extraction so as to obtain predictive label values corresponding to the respective preset ViT networks;
and calculating to obtain a first preset loss function value corresponding to each preset ViT network according to the first preset loss function, the predicted tag value corresponding to each preset ViT network and the real tag value.
3. The training method of claim 1, wherein the second predetermined loss function employs a cross entropy loss function, and wherein the step of performing feature extraction on the input training image dataset through the convolutional neural network to generate the corresponding predictive label value further comprises:
And inputting the training image data set into an initial convolutional neural network for training based on the cross entropy loss function until the cross entropy loss function converges, and obtaining the convolutional neural network after training convergence.
4. The training method according to claim 1, wherein before the step of inputting the feature map output by each intermediate layer to a corresponding preset ViT network to perform feature extraction to obtain a predicted tag value and a first preset loss function value corresponding to each preset ViT network, the training method further includes:
and inputting the training image data set into each initial ViT network for training based on the first preset loss function until the corresponding first preset loss function converges, and obtaining each preset ViT network after training convergence.
5. The training method according to claim 1, wherein the step of calculating an integrated predicted tag value according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network comprises:
and weighting the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network to calculate and obtain an integrated predicted tag value.
6. A training network for an image recognition model, comprising:
the convolutional neural network processing unit is used for extracting characteristics of an input training image data set through a convolutional neural network so as to obtain a predictive label value corresponding to the convolutional neural network;
ViT network processing units connected with the output ends of the plurality of middle layers in the convolutional neural network processing unit, and used for respectively inputting the feature graphs output by the middle layers into the corresponding preset ViT networks to perform feature extraction so as to obtain the predicted tag values and the first preset loss function values corresponding to the preset ViT networks;
The ViT network processing unit is further configured to update weights and offsets of the preset ViT networks according to the corresponding first preset loss function values;
The convolutional neural network processing unit is further configured to calculate an integrated predicted tag value according to the predicted tag value corresponding to the convolutional neural network and the predicted tag value corresponding to each preset ViT network, calculate a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted tag value, a second preset loss function corresponding to the convolutional neural network, and the real tag value, and update the weight and bias of the convolutional neural network according to the second preset loss function value until the second preset loss function converges to generate a corresponding image recognition model.
7. A training device for an image recognition model, comprising:
the label value generating unit is used for extracting characteristics of an input training image data set through a convolutional neural network so as to obtain a predicted label value corresponding to the convolutional neural network;
the characteristic diagram acquisition unit is used for acquiring characteristic diagrams output by a plurality of intermediate layers in the convolutional neural network;
ViT a network feature extraction unit, configured to input the feature graphs output by each intermediate layer to a corresponding preset ViT network respectively for feature extraction, so as to obtain a predicted tag value and a first preset loss function value corresponding to each preset ViT network;
The first updating unit is used for respectively carrying out weight and bias updating on each preset ViT network according to the corresponding first preset loss function value;
the label value integration unit is used for calculating and obtaining integrated predicted label values according to the predicted label values corresponding to the convolutional neural network and the predicted label values corresponding to each preset ViT network;
The loss function value generation unit is used for calculating a second preset loss function value corresponding to the convolutional neural network according to the integrated predicted tag value, the second preset loss function corresponding to the convolutional neural network and the real tag value;
the second updating unit is used for updating the weight and the bias of the convolutional neural network according to the second preset loss function value;
And the model generation unit is used for generating a corresponding image recognition model when the second preset loss function converges.
8. An image recognition method, characterized in that the image recognition is performed using the image recognition model trained by the training method according to any one of claims 1 to 5.
9. A device terminal, characterized in that it comprises a processor and a memory for storing a computer program, the processor running the computer program to cause the device terminal to perform the training method of any of claims 1 to 5.
10. A readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the training method of any one of claims 1 to 5.
CN202210110008.8A 2022-01-29 2022-01-29 Training method and device of image recognition model, network and image recognition method Active CN114463559B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210110008.8A CN114463559B (en) 2022-01-29 2022-01-29 Training method and device of image recognition model, network and image recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210110008.8A CN114463559B (en) 2022-01-29 2022-01-29 Training method and device of image recognition model, network and image recognition method

Publications (2)

Publication Number Publication Date
CN114463559A CN114463559A (en) 2022-05-10
CN114463559B true CN114463559B (en) 2024-05-10

Family

ID=81410757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210110008.8A Active CN114463559B (en) 2022-01-29 2022-01-29 Training method and device of image recognition model, network and image recognition method

Country Status (1)

Country Link
CN (1) CN114463559B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796161A (en) * 2019-09-18 2020-02-14 平安科技(深圳)有限公司 Recognition model training method, recognition device, recognition equipment and recognition medium for eye ground characteristics
CN110889428A (en) * 2019-10-21 2020-03-17 浙江大搜车软件技术有限公司 Image recognition method and device, computer equipment and storage medium
WO2021012526A1 (en) * 2019-07-22 2021-01-28 平安科技(深圳)有限公司 Face recognition model training method, face recognition method and apparatus, device, and storage medium
WO2021102655A1 (en) * 2019-11-25 2021-06-03 深圳市欢太科技有限公司 Network model training method, image property recognition method and apparatus, and electronic device
CN113239981A (en) * 2021-04-23 2021-08-10 中国科学院大学 Image classification method of local feature coupling global representation
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen image classification method based on cross attention distillation transducer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021012526A1 (en) * 2019-07-22 2021-01-28 平安科技(深圳)有限公司 Face recognition model training method, face recognition method and apparatus, device, and storage medium
CN110796161A (en) * 2019-09-18 2020-02-14 平安科技(深圳)有限公司 Recognition model training method, recognition device, recognition equipment and recognition medium for eye ground characteristics
CN110889428A (en) * 2019-10-21 2020-03-17 浙江大搜车软件技术有限公司 Image recognition method and device, computer equipment and storage medium
WO2021102655A1 (en) * 2019-11-25 2021-06-03 深圳市欢太科技有限公司 Network model training method, image property recognition method and apparatus, and electronic device
CN113239981A (en) * 2021-04-23 2021-08-10 中国科学院大学 Image classification method of local feature coupling global representation
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen image classification method based on cross attention distillation transducer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于卷积网络特征迁移的小样本物体图像识别;白洁;张金松;刘倩宇;;计算机仿真;20200515(第05期);全文 *

Also Published As

Publication number Publication date
CN114463559A (en) 2022-05-10

Similar Documents

Publication Publication Date Title
CN109840322B (en) Complete shape filling type reading understanding analysis model and method based on reinforcement learning
CN113343705B (en) Text semantic based detail preservation image generation method and system
CN110134774A (en) It is a kind of based on the image vision Question-Answering Model of attention decision, method and system
CN116580257A (en) Feature fusion model training and sample retrieval method and device and computer equipment
CN112991502A (en) Model training method, device, equipment and storage medium
CN114792349B (en) Remote sensing image conversion map migration method based on semi-supervised generation countermeasure network
CN109902192A (en) Remote sensing image retrieval method, system, equipment and the medium returned based on unsupervised depth
CN115017178A (en) Training method and device for data-to-text generation model
CN113822790B (en) Image processing method, device, equipment and computer readable storage medium
CN114445641A (en) Training method, training device and training network of image recognition model
CN116823782A (en) Reference-free image quality evaluation method based on graph convolution and multi-scale features
CN114463559B (en) Training method and device of image recognition model, network and image recognition method
CN114861917A (en) Knowledge graph inference model, system and inference method for Bayesian small sample learning
CN118135062B (en) Image editing method, device, equipment and storage medium
CN118096922A (en) Method for generating map based on style migration and remote sensing image
CN110866866B (en) Image color imitation processing method and device, electronic equipment and storage medium
CN116797681A (en) Text-to-image generation method and system for progressive multi-granularity semantic information fusion
CN117094963A (en) Fundus image focus segmentation method, system, equipment and storage medium
CN115100435B (en) Image coloring method and system based on finite data multi-scale target learning
CN117593639A (en) Extraction method, device, equipment and medium for highway and its accessories
CN116844008A (en) Attention mechanism guided content perception non-reference image quality evaluation method
CN115457269A (en) Semantic segmentation method based on improved DenseNAS
CN116383439A (en) Method and device for retrieving video by using text
CN111461228B (en) Image recommendation method and device and storage medium
JP7338858B2 (en) Behavior learning device, behavior learning method, behavior determination device, and behavior determination method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240410

Address after: Building 9, Building 203B, Phase II, Nanshan Yungu Entrepreneurship Park, No. 2 Pingshan 1st Road, Pingshan Community, Taoyuan Street, Nanshan District, Shenzhen, Guangdong Province, 518033

Applicant after: Core Computing Integrated (Shenzhen) Technology Co.,Ltd.

Country or region after: China

Address before: 830000 room 801, 8 / F, building E2, Xinjiang Software Park, 455 Kanas Hubei Road, economic and Technological Development Zone (Toutunhe District), Urumqi, Xinjiang Uygur Autonomous Region

Applicant before: XINJIANG AIHUA YINGTONG INFORMATION TECHNOLOGY Co.,Ltd.

Country or region before: China

GR01 Patent grant
GR01 Patent grant