CN116740505A

CN116740505A - Training of image classification model, image classification method, device, machine-readable medium and machine-readable medium

Info

Publication number: CN116740505A
Application number: CN202310818502.4A
Authority: CN
Inventors: 许超
Original assignee: Yuncong Technology Group Co Ltd
Current assignee: Yuncong Technology Group Co Ltd
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-09-12

Abstract

The invention discloses a training method of an image classification model, which comprises the following steps: acquiring a training set formed by a plurality of training samples, wherein each training sample comprises image data and Chinese descriptive data corresponding to the image data; the Chinese descriptive data includes a learnable semantic vector having an image category representation; based on the small sample learning, training the initial model by using the training set to obtain an image classification model. According to the invention, the image classification model is characterized in that the learner-driven semantic vector is added into the Chinese descriptive text sample in the training set, and the performance of the image classification model is improved through the learning of the learner-driven semantic vector, so that the image classification model is used for classifying images, and the classification is more accurate.

Description

Training of image classification model, image classification method, device, machine-readable medium and machine-readable medium

Technical Field

The present invention relates to the field of image processing, and in particular, to a training method, apparatus, machine readable medium and device for an image classification model.

Background

In recent years, with the advent of powerful computing devices (e.g., GPUs and distributed platforms), large data sets (e.g., imageNet data sets, etc.), advanced models and algorithms (e.g., convolutional neural network CNNs and recurrent neural network RNNs), AI has shortened the gap from humans and has in many areas been an example of defeating humans. For example, alphaGo defeats human players in the go field. The success described above relies to a large extent on the learning of large-scale data. Collecting a large number of samples consumes a significant amount of time and money, and even is difficult to obtain due to ethical, privacy, or security issues. Therefore, a small sample Learning (Few-shot Learning) has been proposed to solve the problem of sample Learning of a small number of supervised signals. In performing a few sample study, a fixed, static prompt learning template is used to learn, and this approach works poorly for different data sets.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a training method, device, machine-readable medium and apparatus for image classification model, which are used for solving the problems of the prior art.

To achieve the above and other related objects, the present invention provides a training method of an image classification model, the training method comprising:

acquiring a training set formed by a plurality of training samples, wherein each training sample comprises image data and Chinese descriptive data corresponding to the image data; the Chinese descriptive data includes a learnable semantic vector having an image category representation;

based on the small sample learning, training the initial model by using the training set to obtain an image classification model.

In an embodiment of the present invention, the training the initial model using the training set includes:

extracting features of the training sample through a feature extraction layer in the initial model to obtain first text features and image features;

the context sensing layer in the initial model utilizes a multi-head attention mechanism to enable the image feature to add attention to the first text feature, and a second text feature is obtained;

calculating the similarity between the second text feature and the image feature through a similarity measurement layer in the initial model;

and constructing a loss function based on the similarity, and performing iterative training on the initial model according to the loss function to obtain an image classification model.

In an embodiment of the present invention, the feature extraction of the training sample by the feature extraction layer includes:

extracting features of the image data in the training sample through a visual encoder in a feature extraction layer to obtain image features;

extracting the characteristics of Chinese descriptive data in the training sample through a text encoder of the characteristic extraction layer to obtain a first text characteristic;

the text encoder and the visual encoder are obtained by performing contrast learning training on image features and first text features by using a training set formed by an image sample and Chinese descriptive text samples corresponding to the image sample.

In an embodiment of the present invention, using a multi-head attention mechanism to make the image feature add attention to the first text feature to obtain the second text feature includes:

carrying out global pooling treatment on the image features to obtain global features;

performing feature fusion on the image features and the global features to obtain first fusion features;

inputting the first fusion characteristic into a multi-head attention network to obtain a second fusion characteristic;

and carrying out feature fusion on the image features and the second fusion features to obtain second text features.

In one embodiment of the invention, the image class representation is located at a beginning position, a middle position, or an ending position of the learnable semantic vector.

In an embodiment of the present invention, the loss function includes a cross entropy loss function and a divergence loss function, wherein the cross entropy loss function is used for constraining a first prediction classification value and a true classification value, and the divergence loss function is used for constraining a first prediction classification value and a second prediction classification value, and the second prediction classification value is a zero sample prediction classification value.

In an embodiment of the present invention, in the process of training the image classification model, parameters of the visual encoder and the text encoder are fixed, and the learnable semantic vector is updated.

To achieve the above and other related objects, the present invention provides a training apparatus for an image classification model, the training apparatus comprising:

the data acquisition module is used for acquiring a training set formed by a plurality of training samples, wherein each training sample comprises image data and Chinese descriptive data corresponding to the image data; the Chinese descriptive data includes a learnable semantic vector having an image category representation;

And the training module is used for training the initial model by utilizing the training set based on the small sample learning to obtain an image classification model.

To achieve the above and other related objects, the present invention provides an image classification method comprising:

acquiring an image to be classified;

and inputting the image to be classified into the image classification model, and taking the output of the image classification model as the class of the image to be classified.

To achieve the above and other related objects, the present invention provides an image classification apparatus comprising:

the image acquisition module is used for acquiring images to be classified;

and the image classification module is used for inputting the images to be classified into the image classification model, and taking the output of the image classification model as the category of the images to be classified.

To achieve the above and other related objects, the present invention also provides an electronic device, including:

one or more processors; and

one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform one or more of the training methods or image classification methods of the image classification model described previously.

To achieve the above and other related objects, the present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform one or more of the foregoing training methods of image classification models or the image classification methods.

As described above, the training method, device, machine-readable medium and equipment for image classification model provided by the invention have the following beneficial effects:

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an implementation environment of a training method of an image classification model according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a training method of an image classification model according to an exemplary embodiment of the application;

FIG. 3 is a flow chart illustrating training of an initial model using a training set in accordance with an exemplary embodiment of the present application;

FIG. 4 is a flow chart illustrating the use of a multi-headed attention mechanism to cause an image feature to add attention to a first text feature in accordance with an exemplary embodiment of the present application;

FIG. 5 is a block diagram of a training apparatus for an image classification model according to an exemplary embodiment of the present application

FIG. 6 is a flow chart of an image classification method according to an exemplary embodiment of the invention;

FIG. 7 is a block diagram of an image classification apparatus according to an exemplary embodiment of the present invention;

fig. 8 is a schematic hardware structure of a terminal device according to an embodiment of the present invention;

fig. 9 is a schematic hardware structure of a terminal device according to an embodiment of the invention.

Detailed Description

Further advantages and effects of the present invention will become readily apparent to those skilled in the art from the disclosure herein, by referring to the accompanying drawings and the preferred embodiments. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be understood that the preferred embodiments are presented by way of illustration only and not by way of limitation.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In the following description, numerous details are set forth in order to provide a more thorough explanation of embodiments of the present invention, it will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without these specific details, in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments of the present invention.

In recent years, with the advent of powerful computing devices (e.g., GPUs and distributed platforms), large data sets (e.g., imageNet data sets, etc.), advanced models and algorithms (e.g., convolutional neural network CNNs and recurrent neural network RNNs), AI has shortened the gap from humans and has in many areas been an example of defeating humans. For example, alphaGo defeats human players in the go field. The success described above relies to a large extent on the learning of large-scale data. In contrast, humans can learn quickly with a small number of samples. For example, a child is given several images of an animal that have not been seen, and after learning, he can quickly find similar images of a new animal that has just been learned from a collection of samples. On the other hand, sometimes, collection of a large number of samples requires a lot of time and money costs, and even it is difficult to obtain a large number of samples due to ethical, privacy or security problems. Therefore, a small sample Learning (Few-shot Learning) has been proposed to solve the problem of Learning from a small number of samples of supervised signals. Such as the identification of certain special vehicles, the identification of popular new clothing, etc.

One common practice to construct visual recognition systems is to: a visual model is trained and then a fixed class of discrete labels, such as class 10, class 100, etc., is predicted. This way of learning limits the number of categories to a "closed set" of classification problems, which results in additional data retraining when new categories appear. Therefore, the identification problem of constructing an open set has good expansibility. In recent years, the appearance of vision-language pre-training models such as CLIP and ALIGN brings new ideas for vision characterization learning. Their main ideas are: encoders are built for the two modalities (text and image) separately, and then the two modalities are aligned using a contrast learning training concept. In the reasoning phase, due to the lack of text information, the problem of text input is solved by means of a prompt learning method.

In conventional visual classification, the number of categories of the network is predefined, such as 10-category and 100-category, which is mainly determined by the full connectivity layer of the network. Only the number of good network categories is predefined in advance, the output of the network and the actual labels can be optimized in discrete form. With the rise of a visual-text pre-training model similar to the CLIP, a contrast learning training related method can be utilized to align the text and the image space, so that a fixed class number is not required to be set, and the method has good expandability; in the category of deep learning, a large amount of training data is often required to work effectively. The labeling and collecting of a large number of data samples often requires expensive time and money cost, so that the starting point of small sample learning is to optimize the model by using a small number of samples, and the method has a large practical application value.

The earliest CLIP works simply utilized image category information and then used a prompt learning (prompt learning) method to change the category into a sentence, such as "aphoto of { class_name }. When zero sample reasoning (i.e., no samples are involved in training) is not effective, a model in a particular scenario needs to be optimized with a small number of samples (e.g., 1 sample, 2 samples, etc.). Under different data sets, the use of stiff, static prompt learning templates is not necessarily the best choice.

Meanwhile, under the model of vision-text pre-training, the current application scene is mainly English, and since the previous model training is basically English training, if the model training is directly applied, the model training needs to be translated into English before the model training can be used. Because of the existence of certain special words and contexts in Chinese, the English is very difficult to express in many cases, and the translation itself requires professional domain knowledge and time cost. Therefore, a visual-text pre-training model that adapts to chinese is still necessary.

In view of the foregoing, embodiments of the present application provide a training method of an image classification model, a training apparatus of an image classification model, an image classification method, an image classification apparatus, an electronic device, and a computer-readable storage medium, respectively.

First, terms related to one or more embodiments of the present specification will be explained.

CLIP: contrastive Language-Image Pre-training is performed with high efficiency by utilizing the ideas of contrast learning training on the premise of big data and big model, and a good effect is obtained on a plurality of data sets, namely on zero shot and few shot tasks;

ALIGN: a Large-scale ImaGe and Noisy-text casting uses Large-scale noisy image-text data to augment visual and visual-language representation learning. The authors avoid the workload of preprocessing and labeling the data, requiring only simple filtering based on the frequency of the data. On this dataset, the authors train a very simple dual encoder model ALIGN based on contrast learning to train the loss function;

ResNet: deep Residual Learning for Image Recognition this series of networks is widely used in the field of object classification etc. as part of the classical neural network of the backbone of computer vision tasks, typical networks being resnet50, resnet101 etc. The Resnet network demonstrates that the network can evolve towards deeper (containing more hidden layers).

Transformer: attention IsAllYouNeed the transfomer is characterized by discarding the traditional CNN and RNN, and the whole network structure is completely composed of self-Attention mechanism. Due to its excellent performance and friendliness to downstream tasks, it is widely used in NLP fields such as machine translation, question-answering systems, text summarization and speech recognition, etc.

BERT (Bidirectional Encoder Representations from Transformer, bi-directional semantic coding) is an optimized neural network model for a transducer, and natural language text is extracted and analyzed through an attention mechanism.

FIG. 1 is a schematic diagram of an exemplary training method implementation environment for an image classification model according to the present application. Referring to fig. 1, the implementation environment includes a terminal device 110 and a server 120, where the terminal device 110 and the server 120 communicate through a wired or wireless network. The terminal equipment can be provided with a plurality of images, and then a training set is constructed based on the plurality of images, wherein each training sample comprises image data and Chinese descriptive data corresponding to the image data; the chinese descriptive data includes a learnable semantic vector having an image category representation. An initial model can be set in the terminal equipment or/and the server, and based on small sample learning, the initial model is trained by utilizing the training set, so that an image classification model is obtained. According to the application, the image classification model is characterized in that the learner-driven semantic vector is added into the Chinese descriptive text sample in the training set, and the performance of the image classification model is improved through the learning of the learner-driven semantic vector, so that the image classification model is used for classifying images, and the classification is more accurate.

It should be understood that the number of terminal devices 110 and servers 120 in fig. 1 is merely illustrative. There may be any number of terminal devices 110 and servers 120 as practical.

The terminal device 110 corresponds to a client, and may be any electronic device having a user input interface, including but not limited to a touch screen, a keyboard, physical keys, an audio pick-up device, and the like, including but not limited to a smart phone, a tablet, a notebook, a computer, a car-mounted computer, and the like.

The server 120 corresponds to a server, which may be a server providing various services, may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content DeliveryNetwork, content distribution networks), and basic cloud computing services such as big data and artificial intelligent platforms, which are not limited herein.

The terminal device 110 may communicate with the server 120 through a wireless network such as 3G (third generation mobile information technology), 4G (fourth generation mobile information technology), 5G (fifth generation mobile information technology), and the like, which is not limited herein.

Referring to fig. 2, fig. 2 is a flowchart illustrating a training method of an image classification model according to an exemplary embodiment of the present application. The training method of the image classification model may be applied to the implementation environment shown in fig. 1 and specifically executed by the server 120 in the implementation environment. It should be understood that the training method of the image classification model may be applied to other exemplary implementation environments, and be specifically executed by devices in other implementation environments, and the embodiment does not limit the implementation environments to which the training method of the image classification model is applied.

Referring to fig. 2, fig. 2 is a flowchart illustrating an exemplary method for training an image classification model according to the present application, where the method for training an image classification model at least includes steps S210 to S220, and the detailed description is as follows:

step S210, a training set formed by a plurality of training samples is obtained, and each training sample comprises image data and Chinese descriptive data corresponding to the image data; the Chinese descriptive data includes a learnable semantic vector having an image category representation;

step S220, training the initial model by using the training set based on the small sample learning to obtain an image classification model.

According to the invention, the image classification model is characterized in that the learner-driven semantic vector is added into the Chinese descriptive text sample in the training set, and the performance of the image classification model is improved through the learning of the learner-driven semantic vector, so that the image classification model is used for classifying images, and the classification is more accurate.

In the present invention, it is assumed that f is the input image x obtained through the image encoder whileIs the weight vector obtained by the text encoder. K is the number of image categories, w _i From "aphotoofa [ CLASS ]]"hint template (meaning of hint template refers to a template that changes a word into a sentence.) the word into a sentence has two main considerations, 1. Since a word in Chinese has different meanings in different contexts, the meaning of the word is more accurate only when in a particular sentence, 2. Since training is sentence-level training, it needs to be changed into a sentence when tested. The CLASS is ultimately replaced by specific names such as "dog", "cat", "car", etc. at the time of implementation. The final prediction probability can be described by the following formula:

Where τ is the temperature coefficient, controlling the shape of the distribution, while cso (,) represents cosine similarity.

To replace the rigid, static hint templates, a learnable semantic vector is introduced here, which consists of the following successive vectors: t= [ V ]] ₁ [V] ₂ ...[V] _M [CLASS]

Wherein [ V] _m (m.epsilon. {1,2,3.., M }) represents the same dimension as word-embedding, M is the dimension where the hyper-parameters control the semantic vector. The input of t described above is processed by introducing a text encoder g (). The prediction probability after introducing a learnable semantic vector can be described by the following formula:

in the actual use process, the position of [ CLASS ] in the learnable semantic vector can be at the end, in the middle or at the beginning. For example, a sentence is first changed into a vector with a uniform length (the length of english is 77 and the length of chinese is 52) through a word embedding layer (such as BPE network), and then the feature of the vector is extracted (such as dimension is 52x 512) through an embedding network. Previous methods it was that the vector was fixed. However, here, the small sample learning is performed, and only a few samples (e.g., one image, 2 images, etc.) cannot update the entire network, so that the vector 52×512 is updated. Thus, in training the image classification model, the parameters of the visual encoder and the text encoder are fixed, and the learnable semantic vectors are updated.

Specifically, for semantic vectors in which 16x512 partial vectors become learnable. There are naturally three ways of selecting 16 from 52, 16 being either the first part of 52, the last part, or just the middle part. The idea of ensemble learning is adopted here, the three different positions are fused, and finally the average is taken.

In the present invention, the training samples, that is, the images to be classified, may be images of various types, may be scenic images, or may be character images, and the embodiment of the present invention is not limited thereto.

The image data, i.e. the objects in the training samples, may be detected by the object detection model.

The Chinese descriptive data is text describing the image data, and corresponds to the image data. For example, the training sample is an image containing apples, the image data is apples, and the Chinese descriptive data may be "red fruits".

In this embodiment, the training samples include pictures, video frames, and the like, which are not limited herein. While the form of the chinese descriptive data includes words, phrases, sentences, paragraph articles, etc., not limited herein.

In an embodiment, the image class representation is located at a beginning position, a middle position, or an ending position of the learnable semantic vector.

Referring to fig. 3, fig. 3 is a flowchart illustrating training of an initial model using a training set according to an exemplary embodiment of the present invention. In fig. 3, training the initial model with a training set includes:

step S310, extracting features of the training sample through a feature extraction layer in the initial model to obtain a first text feature and an image feature;

specifically, the feature extraction of the training sample through the feature extraction layer comprises the following steps:

In the present invention, the feature extraction layer includes a visual encoder that can select either a residual network (ResNet) or a transform model, and a text encoder that can select a commonly used transform model, such as Bert. And performing feature extraction on the training sample through a feature extraction layer, namely encoding image data based on a residual network ResNet to obtain a high-level feature representation of the image data, namely image features, and encoding Chinese descriptive data based on a transform model to obtain a high-level feature representation of text data, namely first text features.

Specifically, encoding image data based on a residual network res net includes:

preprocessing image data (namely, pictures), setting the input resolution of the pictures, cutting the pictures by adopting a center cutting method on the basis of zooming the pictures, and carrying out normalization processing on the zoomed and cut pictures; the feature set is formed by extracting features of different dimensions of the normalized image data; selecting sample points and extracting M-dimensional characteristics of the sample points, wherein each sample is characterized by a matrix with the size of MXN, and enhancing original image data by using a random erasing and contrast ratio conversion mode; splitting a data set into a training set and a testing set according to a proportion, converting all the training set and the testing set into binary files, adding sample labels, and inputting the TFRcodes files obtained by conversion as ResNet model data; training the ResNet model results in a high-level feature representation of the image data, i.e., image features.

Specifically, the method for coding Chinese descriptive data based on a transducer model to obtain high-level characteristic representation of text data comprises the following steps:

text preprocessing is carried out through a word segmentation and word removal method and by adopting Bert model processing, so that text vectorization representation is obtained; and constructing a description text of each category according to the classification label of the task, taking an encoder of a transducer model as a feature extractor, and carrying out feature extraction on the Chinese descriptive data to obtain internal information of the Chinese descriptive data, so as to obtain high-level feature representation, namely a first text feature, of the Chinese descriptive data.

In the invention, the text encoder and the visual encoder are obtained by performing contrast learning training on the image features and the first text features by using a training set consisting of the image samples and Chinese descriptive text samples corresponding to the image samples.

The core of training a Chinese-adapted text encoder and visual encoder is to collect a large number of image text pairs (including image samples and Chinese descriptive text samples corresponding to the image samples) data sets, and the sources of the data sets can be online data sets with open sources or related data collected by users. The main reason of adopting the method for contrast learning training is to accelerate the training of the network, if adopting the method of generating formula, the training difficulty is relatively large, and the model is not easy to converge; and the training pairing mode of contrast learning is adopted, so that the training difficulty is reduced. After the models of the text encoder and the visual encoder adapted to Chinese are obtained, small samples of Chinese can be learned.

Step S320, the context sensing layer in the initial model utilizes a multi-head attention mechanism to enable the image feature to add attention to the first text feature, and a second text feature is obtained;

Referring to fig. 4, fig. 4 is a flow chart illustrating a method for using a multi-head attention mechanism to make an image feature to add attention to a first text feature according to an exemplary embodiment of the present invention. In fig. 2, using a multi-headed attention mechanism to cause an image feature to be distracted for a first text feature includes:

step S410, carrying out global pooling processing on the image features to obtain global features;

step S420, carrying out feature fusion on the image features and the global features to obtain first fusion features;

step S430, inputting the first fusion feature into a multi-head attention network to obtain a second fusion feature;

and S440, carrying out feature fusion on the image features and the second fusion features to obtain second text features.

To better illustrate the context-aware layer, a more detailed description of the video encoder is made. Without loss of generality, taking the residual network ResNet as an example, a total of four phases will be experienced during the encoding process of the visual encoder, and these feature maps will be described asIn CLIP, it introduces an additional layer of attention pooling. The residual network Resnet is realized by superposing a plurality of repeated modules, each module can reduce the size of a feature map by one time, shallow feature extraction is local features such as textures, and the extracted features are semantically characterized in a later stage. Specifically, the characteristic +.4 in stage 4 is obtained >Thereafter, the network obtains global features through a global pooling layer/>Here H ₄ 、W ₄ The C-postnet is the image height, width and channel number of stage 4, respectively. Then, carrying out feature fusion on the image features and the global features to obtain a first fusion feature +.>The fusion feature is then fed into a multi-head attention network, resulting in a second fusion feature, i.e.>

In the original CLIP, only features that have undergone global pooling are utilizedThe feature z which is not subjected to global pooling is not utilized, however, the feature z which is not subjected to global pooling has a certain meaning, and mainly retains better spatial information. Finally, a decoder of a transducer is used for fusing the output text characteristic w obtained by the original text encoder and the second fused characteristic z, z]At the same time, by means of the idea of residual learning: transdecoder

Wherein the method comprises the steps ofThe ratio of the residuals is controlled. Thus, a context-aware layer is introduced, which is mainly used to increase the interaction between the visual encoder and the text encoder, and further improve the effect of text semantics by using visual semantics.

Step S330, calculating the similarity between the second text feature and the image feature through a similarity measurement layer in the initial model;

The Chinese descriptive data is subjected to feature extraction by a text encoder to obtain a plurality of text features, and similarity calculation is respectively carried out on each text feature and each image feature, and the text features and the image features can be specifically represented by cosine similarity.

And step S340, constructing a loss function based on the similarity, and carrying out iterative training on the initial model according to the loss function to obtain an image classification model.

In an embodiment, the loss function comprises a cross entropy loss function and a divergence loss function, wherein the cross entropy loss function is used to constrain the first prediction classification value and the true classification value, and the divergence loss function is used to constrain the first prediction classification value and the second prediction classification value, the second prediction classification value being a zero sample prediction classification value.

The loss function plays a vital role in the training process of the network. The loss function here consists of two parts, namely a first prediction classification value and a true classification value, which are constrained by cross entropy loss, and a KL loss function which constrains the first prediction classification value and the second prediction classification value. In the method, the network mainly updates the learnable semantic vector t, and firstly defines a cross entropy loss function:

y represents the one-hot encoding of the authentic tag.

Then, the CLIP zero sample reasoning result p is introduced _zs (w _i I x) and then calculating p (t _i I x) and p _zs (w _i KL (Kullback-Leibler Divergence) divergence loss function of |x):

finally, two loss functions loss are combined together to be optimized:

it should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In summary, the present invention mainly uses CLIP as an algorithm basic framework, and adopts a double-tower architecture, namely a visual encoder and a text encoder. The input consists of two parts: images and text. The visual encoder may choose a residual network (ResNet) or a transform structure, a transform architecture commonly used for text encoders, such as Bert. The earliest CLIP works simply utilized image category information and then used a prompt learning (prompt learning) method to change the category into a sentence, such as "aphoto of { class_name }. When zero sample reasoning (i.e., no samples are involved in training) is not effective, we need to optimize the model in a particular scenario with a small number of samples (e.g., 1 sample, 2 samples, etc.).

Under different data sets, the use of stiff, static prompt learning templates is not necessarily the best choice. Thus, a "learnable semantic vector" is introduced, which is continuous, learnable. The learnable semantic vectors and the categories of the image script together form the input of the text encoder. At the same time, context awareness is also proposed, whose role is mainly to use visual semantics to further improve the effect of text semantics, here simply using the decoder part of the transducer architecture. After obtaining the image features and a series of text features, performing L2 normalization operation, and then calculating the similarity of the image features and the text features. After the predicted value is obtained, two loss function optimization models are used, namely a cross entropy loss constraint predicted value and a true value, a KL loss constraint predicted value and a zero sample predicted value are respectively utilized, a total loss function is formed through the two loss functions, then iteration training is carried out on the initial model according to the loss functions, and finally an image classification model is obtained.

According to the invention, the learner-driven semantic vector is added into the Chinese descriptive text sample in the training set, and the performance of the image classification model is improved through the learning of the learner-driven semantic vector, so that the image classification model is utilized to classify the image, and the classification is more accurate.

FIG. 5 is a block diagram of a training apparatus for an image classification model, according to an exemplary embodiment of the present invention. The device can be applied to the implementation environment shown in fig. 1, and is specifically configured in a server or a terminal device. The training device of the image classification model may also be suitable for other exemplary implementation environments, and is specifically configured in other devices, and the embodiment does not limit the implementation environments to which the training device of the image classification model is suitable.

As shown in fig. 5, the present invention further provides a training device for an image classification model, where the device includes:

a data acquisition module 510, configured to acquire a training set composed of a plurality of training samples, where each training sample includes image data and chinese descriptive data of corresponding image data; the Chinese descriptive data includes a learnable semantic vector having an image category representation;

the training module 520 is configured to train the initial model with a training set based on the small sample learning, so as to obtain an image classification model.

It should be noted that, the training device for the image classification model provided in the above embodiment and the training method for the image classification model provided in the above embodiment belong to the same concept, and the specific manner in which each module and unit perform the operation has been described in detail in the embodiment of the training method for the image classification model, which is not described herein. In practical application provided by the above embodiment, the above function allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above, which is not limited herein.

Referring to fig. 6, fig. 6 is a flowchart illustrating an image classification method according to an exemplary embodiment of the application, the classification method includes:

step S610, obtaining an image to be classified;

in step S620, the image to be classified is input into the image classification model obtained by training in the method shown in fig. 2, and the output of the image classification model is used as the class of the image to be classified.

Referring to fig. 7, fig. 7 is a block diagram of an image classification apparatus according to an exemplary embodiment of the present application, the classification apparatus includes:

an image acquisition module 710, configured to acquire an image to be classified;

the image classification module 720 is configured to input the image to be classified into an image classification model, and take an output of the image classification model as a class of the image to be classified.

The embodiment of the application also provides equipment, which can comprise: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the training method of the image classification model of fig. 2. In practical applications, the device may be used as a terminal device or may be used as a server, and examples of the terminal device may include: smart phones, tablet computers, e-book readers, MP3 (dynamic video expert compression standard voice plane 3,Moving Picture Experts Group Audio Layer III) players, MP4 (dynamic video expert compression standard voice plane 4,Moving Picture Experts GroupAudio Layer IV) players, laptop computers, car computers, desktop computers, set-top boxes, smart televisions, wearable devices, etc., embodiments of the present application are not limited to specific devices.

The embodiment of the application also provides a non-volatile readable storage medium, in which one or more modules (programs) are stored, where the one or more modules are applied to a device, and the device can execute instructions (instructions) of steps included in the training method of the image classification model in fig. 2 according to the embodiment of the application.

Fig. 8 is a schematic hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103 and at least one communication bus 1104. The communication bus 1104 is used to enable communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may further include a nonvolatile memory NVM, such as at least one magnetic disk memory, where various programs may be stored in the first memory 1103 for performing various processing functions and implementing the training method steps of the image classification model of the present embodiment.

Alternatively, the first processor 1101 may be implemented as, for example, a central processing unit (Central Processing Unit, abbreviated as CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Alternatively, the input device 1100 may include a variety of input devices, for example, may include at least one of a user-oriented user interface, a device-oriented device interface, a programmable interface of software, a camera, and a sensor. Optionally, the device interface facing to the device may be a wired interface for data transmission between devices, or may be a hardware insertion interface (such as a USB interface, a serial port, etc.) for data transmission between devices; alternatively, the user-oriented user interface may be, for example, a user-oriented control key, a voice input device for receiving voice input, and a touch-sensitive device (e.g., a touch screen, a touch pad, etc. having touch-sensitive functionality) for receiving user touch input by a user; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, for example, an input pin interface or an input interface of a chip, etc.; the output device 1102 may include a display, sound, or the like.

In this embodiment, the processor of the terminal device may include a function for executing each module in each device, and specific functions and technical effects may be referred to the above embodiments and are not described herein again.

Fig. 9 is a schematic hardware structure of a terminal device according to an embodiment of the present application. Fig. 9 is a specific embodiment of the implementation of fig. 8. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the steps of the training method of the image classification model of fig. 2 in the above-described embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, video, etc. The second memory 1202 may include a random access memory (random access memory, simply RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: a communication component 1203, a power component 1204, a multimedia component 1205, a voice component 1206, an input/output interface 1207, and/or a sensor component 1208. The components and the like specifically included in the terminal device are set according to actual requirements, which are not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps in the training method for image classification models described above. Further, the processing component 1200 may include one or more modules that facilitate interactions between the processing component 1200 and other components. For example, the processing component 1200 may include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. Power supply components 1204 can include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for terminal devices.

The multimedia component 1205 includes a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation.

The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received voice signals may be further stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the voice component 1206 further includes a speaker for outputting voice signals.

The input/output interface 1207 provides an interface between the processing assembly 1200 and peripheral interface modules, which may be click wheels, buttons, and the like. These buttons may include, but are not limited to: volume button, start button and lock button.

The sensor assembly 1208 includes one or more sensors for providing status assessment of various aspects for the terminal device. For example, the sensor assembly 1208 may detect an on/off state of the terminal device, a relative positioning of the assembly, and the presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communication between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card, so that the terminal device may log into a GPRS network and establish communication with a server via the internet.

From the above, the communication component 1203, the voice component 1206, the input/output interface 1207, and the sensor component 1208 in the embodiment of fig. 9 can be implemented as the input device in the embodiment of fig. 7.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. A method of training an image classification model, the method comprising:

2. The method of training an image classification model according to claim 1, wherein training an initial model using the training set comprises:

3. The method for training an image classification model according to claim 2, wherein the feature extraction of the training sample by the feature extraction layer comprises:

4. A method of training an image classification model according to claim 3, wherein using a multi-headed attention mechanism to cause image features to add attention to a first text feature to obtain a second text feature comprises:

5. The method of claim 1, wherein the image class representation is located at a beginning position, a middle position, or an ending position of the learnable semantic vector.

6. The method of training an image classification model according to claim 2, wherein the loss function comprises a cross entropy loss function and a divergence loss function, wherein the cross entropy loss function is used to constrain a first prediction classification value and a true classification value, and the divergence loss function is used to constrain a first prediction classification value and a second prediction classification value, the second prediction classification value being a zero sample prediction classification value.

7. A method of training an image classification model according to claim 3, wherein the learnable semantic vector is updated by fixing parameters of the visual encoder and the text encoder during training of the image classification model.

8. A training device for an image classification model, the training device comprising:

9. An image classification method, characterized in that the classification method comprises:

acquiring an image to be classified;

inputting the image to be classified into the image classification model according to any one of claims 1-7, and taking the output of the image classification model as the class of the image to be classified.

10. An image classification apparatus, characterized in that the classification apparatus comprises:

the image acquisition module is used for acquiring images to be classified;

the image classification module is used for inputting the image to be classified into the image classification model according to any one of claims 1-7, and taking the output of the image classification model as the class of the image to be classified.

11. An electronic device, comprising:

one or more processors; and

one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the training method of the image classification model of one or more of claims 1-7 or the image classification method of claim 9.

12. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform the training method of an image classification model as set forth in one or more of claims 1-7 or the image classification method of claim 9.