CN115049880B

CN115049880B - Image classification network generation method, device, equipment and medium

Info

Publication number: CN115049880B
Application number: CN202210765740.9A
Authority: CN
Inventors: 郑明凯; 游山; 王飞; 钱晨
Original assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2024-10-29
Anticipated expiration: 2042-07-01
Also published as: CN115049880A

Abstract

The present disclosure provides a method, an apparatus, an electronic device, and a storage medium for generating an image classification network, where the method for generating the image classification network includes: acquiring an image sample set of the training round; respectively inputting the image sample set into a first image classification network and a second image classification network to obtain a first output result output by the first image classification network and a second output result output by the second image classification network; determining a first loss of a preset first loss function based on the first output result and a real class label corresponding to each sample image; determining a second loss of a preset second loss function based on the first output result and the historical second output result of the round of training; performing parameter adjustment on the first image classification network based on the first loss and the second loss; and repeating the training for a plurality of times to obtain a first image classification network with the training completed. According to the embodiment of the application, the training precision of the first image classification network can be improved.

Description

Image classification network generation method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a method for generating an image classification network, an image classification method, an apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, various deep learning models are gradually started to be applied in various fields. Among them, image classification is considered as the most basic task in computer vision, and various downstream tasks such as object detection, video analysis, semantic segmentation, or the like are generally performed based on the result of image classification. Therefore, how to improve the image classification accuracy is important.

Disclosure of Invention

The embodiment of the disclosure at least provides a method for generating an image classification network, an image classification method, an image classification device, electronic equipment and a storage medium.

The embodiment of the disclosure provides a method for generating an image classification network, which comprises the following steps:

In each round of training, acquiring an image sample set of the round of training; the image sample set comprises at least two sample images, and each sample image is marked with a real category label of each sample image;

Respectively inputting the image sample set into a first image classification network and a second image classification network to obtain a first output result output by the first image classification network and a second output result output by the second image classification network in the round of training, wherein the second image classification network is obtained based on the first image classification network after the adjustment parameters are obtained in at least one round of training;

Determining a first loss of a preset first loss function based on a first output result obtained by the training and a real class label corresponding to each sample image;

Determining a second loss of a preset second loss function based on the first output result and the historical second output result of the round of training, wherein the second loss function represents the distance between the first output result and a sample image with similarity larger than a preset threshold value in the historical second output result; the historical second output result comprises a second output result obtained in at least one round of historical training before the round of training;

Performing parameter adjustment on the first image classification network based on the first loss and the second loss;

repeating the training for a plurality of times to obtain a first image classification network with the training completed.

In the embodiment of the disclosure, since the network parameters are adjusted according to the first loss function and the second loss function at the same time, that is, the traditional mode of combining the class center vector and the contrast learning is used for network training, the distances between the similar image samples can be shortened, and the similar image samples which are more similar can be shortened while the distances between the similar image samples are shortened, so that the aim of tightening the similar features is achieved, and further the image classification precision of the first image classification network is facilitated to be improved.

In one possible implementation, before the parameter adjustment of the first image classification network based on the first loss and the second loss, the method further includes:

generating a virtual category label of the image sample set in the round of training based on a second output result of the round of training and the historical second output result;

Determining a third loss of a preset third loss function based on the difference between the first output result of the round of training and the virtual category label;

The parameter adjustment of the first image classification network based on the first loss and the second loss includes:

parameter adjustments are made to the first image classification network based on a weighted sum of the first, second, and third losses.

In the embodiment of the disclosure, since the first image classification network is constrained based on three loss functions, that is, the first output result and the virtual class label are constrained by the third loss function in addition to the first loss function and the second loss function, the image classification accuracy of the first image classification network can be further improved.

In a possible embodiment, the second image classification network is obtained by a sliding average of the first image classification network after the adjustment parameters obtained in the completed at least one training round.

In the embodiment of the disclosure, the second image classification network is obtained by accumulating and iterating the first image classification network at a slower speed, so that timeliness of the historical second output result can be ensured, and further, the determination precision of the second loss function can be improved, thereby being beneficial to improving the classification precision of the first image classification network.

In a possible implementation manner, the first output result of the present training includes a first feature of each sample image of the present training and a first prediction probability distribution of each sample image of the present training, wherein the first prediction probability distribution characterizes a distribution of probabilities that each sample image belongs to each category of a preset plurality of categories;

The determining the first loss of the preset first loss function based on the first output result of the present training and the real class label corresponding to each sample image comprises the following steps:

and determining the first loss of the first loss function based on the first prediction probability distribution of each sample image of the current round of training and the real class label corresponding to each sample image of the current round of training.

In this embodiment, the first loss of the first loss function is determined by the first prediction probability distribution of each sample image and the real class label corresponding to each sample image, so that the determination accuracy of the first loss can be improved, and further the image classification accuracy of the first image classification network is improved.

In a possible implementation manner, the historical second output result comprises a second characteristic of each sample image acquired in at least one round of training;

the determining the second loss of the preset second loss function based on the first output result and the historical second output result of the present training includes:

and determining a second loss of the second loss function based on the first characteristic of each sample image of the current round of training and the second characteristic of each sample image acquired in the at least one round of training.

In this embodiment, the second loss of the second loss function is determined by the first feature of each sample image and the second feature of each sample image acquired in at least one cycle of training of the history, so that the determination accuracy of the second loss can be improved, and further the image classification accuracy of the first image classification network is improved.

In a possible implementation manner, the first output result of the present training includes a first prediction probability distribution of each picture of the present training, the second output result of the present training includes a second feature of each sample image of the present training, the historical second output result includes a second feature of each sample image output by a second image classification network in at least one cycle of the history training and a second prediction probability distribution of each sample image, and the second prediction probability distribution represents a probability distribution of each sample image belonging to each of a preset plurality of categories;

The generating a virtual category label of the image sample set in the round of training based on the second output result of the round of training and the historical second output result comprises the following steps:

for each sample image of the present round of training, determining a similarity distribution between second features of the image samples of the present round of training relative to second features of each sample image in the historical second output result;

Weighting the second probability distribution of each sample image in the history second output result based on the similarity distribution to obtain a virtual category label of the image sample of the training round;

The determining a third loss of a preset third loss function based on the difference between the first output result of the present round of training and the virtual category label comprises the following steps:

And determining a third loss of the third function based on the virtual class label of each sample image of the present training and the first prediction probability distribution of each sample image of the present training.

In the embodiment of the disclosure, in the training process of the first image classification network, the first prediction probability distribution of the similar image samples is further constrained through the third loss function, so that the similar image samples predict the similar probability distribution, further the characteristics of the similar samples are more compact, and the image classification capability of the first image classification network is improved.

In one possible embodiment, before determining the second loss of the preset second loss function based on the first output result of the present training and the historical second output result, the method further includes:

and reading the historical second output results from the cache, wherein the number of the second features contained in the historical second output results in the cache is a fixed value, so that the classification accuracy of the first image classification network can be ensured, and the classification efficiency of the network can be improved.

In a possible implementation manner, the second output result obtained by the training is added to the buffer when the number of the second features contained in the buffer is smaller than the fixed value; or alternatively

And deleting the historical second output result with the earliest entering time in the buffer memory under the condition that the number of the second features contained in the buffer memory is not smaller than the fixed value, and adding the second output result obtained by the round of training into the buffer memory. Therefore, the number of the second features in the historical second output result can be ensured to be maintained at the fixed value and timeliness is maintained, and further the image classification precision of the first image classification network is improved.

The embodiment of the disclosure provides an image classification method, which comprises the following steps:

Acquiring an image to be classified;

inputting the images to be classified into an image classification network to obtain classification results; the image classification network is obtained by the method of any of the preceding embodiments.

The embodiment of the disclosure provides a device for generating an image classification network, which comprises:

the sample acquisition module is used for acquiring an image sample set of the training in each round of training; the image sample set comprises at least two sample images, and each sample image is marked with a real category label of each sample image;

the result detection module is used for respectively inputting the image sample set into a first image classification network and a second image classification network to obtain a first output result output by the first image classification network and a second output result output by the second image classification network in the round of training, wherein the second image classification network is obtained based on the first image classification network after the adjustment parameters are obtained in at least one round of training;

The first determining module is used for determining a first loss of a preset first loss function based on a first output result obtained by the round of training and a real class label corresponding to each sample image;

the second determining module is used for determining second loss of a preset second loss function based on the first output result and the historical second output result of the round of training, wherein the second loss function represents the distance between the first output result and a sample image with similarity larger than a preset threshold value in the historical second output result; the historical second output result comprises a second output result obtained in at least one round of historical training before the round of training;

The parameter adjustment module is used for performing parameter adjustment on the first image classification network based on the first loss and the second loss;

and the iterative training module is used for repeating the training for a plurality of times to obtain a first image classification network after the training is completed.

In a possible implementation manner, the apparatus further includes a tag generation module and a third determination module, where the tag generation module is configured to:

the third determining module is configured to:

the parameter adjustment module is specifically configured to:

In a possible implementation manner, the first output result of the present training includes a first feature of each sample image of the present training and a first prediction probability distribution of each sample image of the present training, wherein the first prediction probability distribution characterizes a distribution of probabilities that each sample image belongs to each category of a preset plurality of categories; the first determining module is specifically configured to:

In a possible implementation manner, the historical second output result comprises a second characteristic of each sample image acquired in at least one round of training; the second determining module is specifically configured to:

In a possible implementation manner, the first output result of the present training includes a first prediction probability distribution of each picture of the present training, the second output result of the present training includes a second feature of each sample image of the present training, the historical second output result includes a second feature of each sample image output by a second image classification network in at least one cycle of the history training and a second prediction probability distribution of each sample image, and the second prediction probability distribution represents a probability distribution of each sample image belonging to each of a preset plurality of categories; the label generation module is specifically configured to:

the third determining module is specifically configured to:

In one possible implementation, the apparatus further includes a result access module for:

and reading the historical second output result from the cache, wherein the number of the second features contained in the historical second output result in the cache is a fixed value.

In one possible implementation, the result access module is further configured to:

Adding a second output result obtained by the training in this round into the buffer memory under the condition that the number of second features contained in the buffer memory is smaller than the fixed value; or alternatively

And deleting the historical second output result with the earliest entering time in the buffer memory under the condition that the number of the second features contained in the buffer memory is not smaller than the fixed value, and adding the second output result obtained by the round of training into the buffer memory.

The embodiment of the disclosure provides an image classification device, which comprises:

The image acquisition module is used for acquiring images to be classified;

The image classification module is used for inputting the images to be classified into an image classification network to obtain classification results; the image classification network is obtained by the method of any of the preceding embodiments.

The embodiment of the disclosure provides an electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of generating an image classification network as described in any of the previous embodiments, or the steps of the image classification method as described above.

The presently disclosed embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of generating an image classification network as described in the first aspect and any of the preceding embodiments, or the steps of the image classification method as described above.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 illustrates a flow chart of a method of generating an image classification network provided by embodiments of the present disclosure;

FIG. 2 is a schematic diagram of an image sample approximation process provided by an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of another method of generating an image classification network provided by embodiments of the present disclosure;

FIG. 4 illustrates a flow chart of a method of image classification provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an apparatus for generating an image classification network according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of another apparatus for generating an image classification network according to an embodiment of the disclosure;

fig. 7 is a schematic structural diagram of an image classification apparatus according to an embodiment of the present disclosure;

Fig. 8 shows a schematic diagram of an electronic device provided by an embodiment of the disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The term "and/or" is used herein to describe only one relationship, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

First, description and explanation are made on related noun terms involved in the embodiments of the present application:

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eyes to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.

Among them, image classification is considered as the most basic task in computer vision, and various downstream tasks such as object detection, video analysis, semantic segmentation, or the like are generally performed based on the result of image classification. Therefore, how to improve the image classification accuracy is important.

According to research, the traditional image classification model aims at "pulling" the features of the similar image samples to the corresponding class center vectors, namely reducing the distance between the features of the similar image samples and the corresponding class center vectors; and "pushing" the image sample features away from their different class vectors, i.e., increasing the distance between the features of the image sample and the different class center vectors. In addition, contrast learning has been very successful in the field of unsupervised learning, and in the related art, contrast learning is also applied to the field of supervised learning, which is different from the traditional image classification in that the features of similar image samples are directly pulled, and different similar image samples are pushed apart without similar center vectors. The above method can improve the classification accuracy of the image classification model, but has no ideal effect.

Based on the above study, the present disclosure provides a method of generating an image classification network, which may include the steps of:

According to the method for generating the image classification network, network parameters are adjusted according to the first loss function and the second loss function, namely, network training is conducted in a mode of combining traditional class center vectors and contrast learning, so that distances between similar image samples can be shortened, similar image samples can be shortened while distances between similar image samples are shortened, the purpose of enabling similar features to be tightened more can be achieved, and further image classification accuracy of the image classification network is improved.

The method for generating the image classification network according to the embodiment of the present disclosure is described in detail below, and the method for generating the image classification network may be applied to a terminal device, or may be applied to a server, or may be applied to an implementation environment formed by the terminal device and the server. In addition, the method for generating the image classification network may be software running in the terminal device or the server, such as an application program with a network training function, etc.

The terminal device may be a mobile device, a user terminal, a vehicle-mounted device, a computing device, a wearable device, or the like. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like. In some possible implementations, the method of generating the image classification network may be implemented by way of a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a method for generating an image classification network according to an embodiment of the present disclosure includes the following steps S101 to S106:

S101, in each round of training, acquiring an image sample set of the round of training; the image sample set comprises at least two sample images, and each sample image is marked with a real category label of each sample image.

The method for generating the image classification network in the embodiment of the disclosure is one of supervised contrast learning, and therefore each sample image is labeled with a real class label. In one embodiment, the real class label is a vector matrix, e.g., [0,0,0,1,0,0], which may also be referred to as a hard label, i.e., where one dimension has a value of 1 and the other dimension has a value of 0.

It will be appreciated that multiple training is typically performed during the training of the neural network, where a training refers to the process in which the neural network to be trained learns all of the image samples once. For example, a large number of image samples may be prepared in advance before the network training, and the large number of image samples are divided into a plurality of image sample sets, one image sample set (also called a batch) is input to the network to be trained for each training, and after the plurality of image sample sets are input to the network to be trained, one training is completed.

For example, if the large number of image samples has 128.1 ten thousand image samples, and 1024 image samples are taken as one image sample set, the image sample set may be divided into 1251 (128.1 w/1024) image sample sets, that is, training is performed in 1251 rounds, and the 1251 rounds of training correspond to one training, that is, one training iteration 1251 round. Of course, the numbers in this example are merely illustrative, and are merely for explaining the training of this time in the present application, and it will be understood that in other embodiments, the size of the training set and the number of iterations of training may be set accordingly according to the actual situation, which is not limited specifically herein.

S102, respectively inputting the image sample set into a first image classification network and a second image classification network to obtain a first output result output by the first image classification network and a second output result output by the second image classification network in the training of the present round, wherein the second image classification network is obtained based on the first image classification network after the adjustment parameters obtained in at least one round of training are completed.

The first image classification network is used for extracting features of the image samples and predicting probability distribution corresponding to the features based on the extracted features, and the second image classification network is also used for extracting the features of the image samples and predicting probability parts corresponding to the features based on the extracted features. The first image classification network and the second image classification network are different, and the result output by the first image classification network and the result output by the second image classification network are distinguished by a first image classification network and a second image classification network. Thus, the first output result comprises a first characteristic of the each sample image and a first predictive probability distribution of the each picture, and the second output result comprises a second characteristic of the each sample image and a second predictive probability distribution of the each picture. Wherein the first predictive probability distribution characterizes a distribution of probabilities that each sample image belongs to each of a preset plurality of categories, and the second predictive probability distribution also characterizes a distribution of probabilities that each sample image belongs to each of a preset plurality of categories.

Of course, in other embodiments, the first output result and the second output result may only include the features of the image sample or the predictive probability distribution. Specifically, the content included in the output result can be determined according to the actual application situation.

Alternatively, since the second image classification network is similar to the first image classification network in terms of feature extraction and prediction probability distribution, the first image classification network will be described in detail below.

After inputting the image samples to the first image classification network for each sample image, the image samples can be encoded by a convolution encoder of the first image classification network to obtain first features; and then performing point multiplication operation on the first feature and the class center feature of the first image classification network to obtain a first operation result, and performing normalization processing on the first operation result to obtain a first prediction probability distribution of the image sample. Specifically, the first predictive probability distribution may be obtained by a softmax function. The first prediction probability distribution is used for indicating the probability that the image sample belongs to each class center.

In this embodiment of the present disclosure, after obtaining the second output result, the second output result needs to be cached, so, with iteration of the training round number, the second output result of the historical multi-round training may be obtained, for use in determining the second loss of the second loss function in the subsequent training, and details about the historical second output result will be described in detail later.

Illustratively, the second image classification network is obtained based on the first image classification network, which means that the preset association relationship exists between the second image classification network and the first image classification network, and the second image classification network is used for assisting in training the first image classification network. Alternatively, the first image classification network may also be referred to as a student network, also an online training network, and the second image classification network may be referred to as a teacher network.

In some embodiments, the second image classification network may be obtained by a sliding average of the first image classification network after the adjustment parameters obtained in the completed at least one training round. Wherein a running average, i.e., an exponentially weighted average, can be used to estimate the local mean of the variable such that the update of the variable is related to the historical value over a period of time.

Specifically, the moving average can be achieved by the following formula (1).

Where F _t represents the second image classification network, F _s represents the first image classification network, m is a coefficient, and typically takes a value close to 1, such as 0.99999. That is, the second image classification network is obtained by accumulating and iterating the first image classification network at a slower speed, so that the timeliness of the historical second output result can be enhanced, the second features in the historical second output result have relevance, and the effectiveness of the second loss function can be improved.

And S103, determining a first loss of a preset first loss function based on a first output result obtained by the round of training and the real class labels corresponding to each sample image.

For example, the first penalty of the first penalty function may be determined based on the first predictive probability distribution for each sample image of the current run and the true class label for each sample image of the current run. That is, the first loss function is used to characterize the distance between the first predictive probability distribution and the true category labels.

Specifically, for each sample image, the first loss of the first loss function may be determined based on the real class label corresponding to the image sample with the first prediction probability distribution of the image sample, and after the first loss of each image sample is obtained, the average value of the first losses of each image sample may be determined, so as to obtain the first loss of the first loss function.

Alternatively, the first loss function may be a cross entropy loss function, and alternatively, the first loss function may also be a relative entropy loss function, which is not limited herein.

S104, determining a second loss of a preset second loss function based on a first output result and a historical second output result of the round of training, wherein the second loss function represents the distance between the first output result and a sample image with similarity larger than a preset threshold value in the historical second output result; the historical second output result comprises a second output result obtained in at least one round of historical training before the round of training.

Illustratively, because the historical second output results are cached from the second output results of at least one previous training round, the historical second output results include a second characteristic of each sample image of at least one previous training round and a second predictive probability distribution of each sample image of at least one previous training round. For example, if the present training is the 5 th training, the historical second output result may be a set of second output results output by at least one of the first 4 training rounds.

It can be understood that if the number of the second features of each sample image in the historical second output result is too small, there may be no similar samples in the present training and the historical training, which may cause inaccurate loss value of the loss function and reduce the referential property, and if the number of the second features of each sample image in the historical second output result is too large, more memory is occupied, thereby affecting the training efficiency. Thus, in some embodiments, before determining the second loss of the preset second loss function based on the first output result of the present round of training and the historical second output result, the method further comprises: and reading the historical second output result from the cache, wherein the number of the second features contained in the historical second output result in the cache is a fixed value. Optionally, the fixed value is greater than the first preset number and less than the second preset number. Thus, the classification accuracy of the first image classification network can be improved, and the classification efficiency can be improved.

Since the output result of each sample image corresponds to one feature and the prediction probability distribution, the number of second prediction probability distributions of each sample image in the history second output result is the same as the second feature.

Illustratively, in the case where the number of second features contained in the cache is less than the fixed value, the method further comprises: and adding the second output result obtained by the training in the round into the buffer memory.

However, with the iteration of the training round number, if the second output result obtained by each round of training is saved, the content of the historical second output result is increased, the load of buffering and calculation is greater, and in some embodiments, in order to maintain the number of the second features in the historical second output result at the fixed value and maintain timeliness, the method further includes: and deleting the historical second output result with the earliest entering time in the buffer memory under the condition that the number of the second features contained in the buffer memory is not smaller than the fixed value, and adding the second output result obtained by the round of training into the buffer memory, namely, executing a first-in first-out principle on the second output result in the buffer memory.

For example, in connection with the above description taking the example that each training includes 1024 image samples, if the fixed value is 65536 as an example, 65536 (64×1024) features are reached when the number of training theory reaches the 64 th round, when the training is 65 th round, the second output result obtained by the 65 th round training is replaced by the second output result obtained by the 1 st round training, the second output result obtained by the 66 th round training is replaced by the second output result obtained by the 2 nd round training, and so on until the training is finished. Therefore, the method for generating the image classification network according to the embodiment of the present disclosure may be performed during the nth training in the first training process, where the specific number of N needs to be determined according to the number of image samples in each training cycle and the preset number in the preset historical second output result.

In one embodiment, the determining the second loss of the preset second loss function based on the first output result of the current training and the historical second output result includes: and determining a second loss of the second loss function based on the first characteristic of each sample image of the current round of training and the second characteristic of each sample image acquired in the at least one round of training.

Specifically, for each sample image of the present training, the first feature of each sample image and the second feature of each sample image obtained in at least one training cycle of the history may be subjected to a dot product operation, so as to obtain a second loss for each sample image, and then the second losses of each image sample of the present training cycle are averaged to obtain a second loss of the second loss function.

For example, the formula of the second loss function may be shown in the following formula (2):

Wherein, The second loss of the second loss function is represented, log is a logarithmic function, exp is an exponential function based on a natural constant e, Z _i is an image sample feature, pos (i) is a similar sample feature set, neg (i) is a different sample feature set, Z _p is a similar sample feature with a sample image in a historical second output result, Z _n is a different sample feature with a sample pattern feature in the historical second output result, and T _sup is a temperature parameter for controlling distribution sharpness.

Since the temperature parameter is generally a common parameter, the temperature parameter will not be discussed in detail in this embodiment.

S105, parameter adjustment is carried out on the first image classification network based on the first loss and the second loss.

For example, after the first loss and the second loss are obtained, the parameters of the first image classification network may be adjusted according to the first loss and the second loss, so that the next training may be continued based on the first image classification network after the parameters are adjusted. It will be appreciated that the performance of the first image classification network will converge towards the loss function constraint with each round of parameter adjustment.

And S106, repeating the training for a plurality of times to obtain a first image classification network after the training is completed.

By way of example, as described above, by continuously performing iterative training on the first image classification network, the performance of the first image classification network is increasingly stronger, so after repeating the training for a plurality of rounds, the training can be stopped under the condition that the training result meets the preset requirement, and the first image classification network after the training is completed is obtained.

The total training times (such as 12 ten thousand times) or the training wheel number can be preset, and the training result is determined to meet the preset requirement under the condition that the training times reach the preset total training times or the training wheel number reaches the preset total wheel number. Optionally, it may also be determined that the training result meets the preset requirement when the loss values of the preset first loss function and the second loss function (i.e., the first loss and the second loss) respectively reach the corresponding preset threshold values.

Referring to fig. 2, in the method for generating an image classification network according to the embodiment of the present disclosure, since the network parameters are adjusted according to the first loss function and the second loss function, that is, the network training is performed in a manner of combining the conventional class center vector and the contrast learning, not only the distances between the similar image samples can be shortened, but also similar image samples (such as 11 and 12 or 21 and 22) which are more similar can be shortened while the distances between the similar image samples are shortened, so that the object of tightening the similar features is achieved, and further the image classification accuracy of the first image classification network is facilitated to be improved.

In order to further enhance the classification capability of the first image classification network, referring to fig. 3, another method for generating an image classification network is provided in an embodiment of the present disclosure, which includes the following steps S301 to S208:

S301, in each round of training, acquiring an image sample set of the round of training; the image sample set comprises at least two sample images, and each sample image is marked with a real category label of each sample image.

This step is similar to the aforementioned step S101, and will not be described again here.

S302, respectively inputting the image sample set into a first image classification network and a second image classification network to obtain a first output result output by the first image classification network and a second output result output by the second image classification network in the training of the present round, wherein the second image classification network is obtained based on the first image classification network after the adjustment parameters obtained in at least one round of training are completed.

This step is similar to the aforementioned step S102, and will not be described again here.

S303, determining a first loss of a preset first loss function based on a first output result obtained by the training and the real class label corresponding to each sample image.

This step is similar to the aforementioned step S103 and will not be described again here.

S304, determining a second loss of a preset second loss function based on a first output result and a historical second output result of the round of training, wherein the second loss function represents the distance between the first output result and a sample image with similarity larger than a preset threshold value in the historical second output result; the historical second output result comprises a second output result obtained in at least one round of historical training before the round of training.

This step is similar to the aforementioned step S104, and will not be described again here.

S305, generating a virtual category label of the image sample set in the round of training based on the second output result of the round of training and the historical second output result.

Illustratively, when generating the virtual category labels of the image sample set in the present round of training based on the second output result of the present round of training and the historical second output result, the following (a) to (b) may be included:

(a) And for each sample image of the current training, determining similarity distribution between second characteristics of the image samples of the current training relative to second characteristics of each sample image in the historical second output result.

(B) And weighting the second probability distribution of each sample image in the history second output result based on the similarity distribution to obtain the virtual category label of the image sample of the training round. Specifically, for each sample image trained in this time, performing a dot product operation on the second feature of the image sample and the second feature in the historical second output result to obtain a second operation result, calculating similarity through a normalization function (softmax function), and then weighting the second prediction probability distribution in the historical second output result by the similarity to obtain a virtual class label (also called a soft label) of the image sample.

The form of the virtual category label is similar to that of the real category label, except that the virtual category label is multi-dimensional data, each dimension corresponds to a category, and the data of each dimension represents the probability value of the corresponding category. The data of one dimension in the virtual category label is larger, the probability value representing the category corresponding to the dimension is highest, and the data of other dimensions are smaller and are not all 0.

S306, determining a third loss of a preset third loss function based on the difference between the first output result of the round of training and the virtual category label.

In particular, a third penalty of the third function may be determined based on the virtual class labels of each sample image of the present run and the first predictive probability distribution of each sample image. That is, a third loss corresponding to the image sample may be determined based on the virtual class label of each sample image and the first prediction probability distribution of each sample image, and then the third loss corresponding to each sample image may be averaged to obtain the third loss of the third function.

Wherein the third loss function is used to reduce the distance between the virtual class labels and the first predictive probability distribution for each sample image. Optionally, the third loss function is a relative entropy loss function (also called KL divergence), and may also be a cross entropy loss function, which is not limited herein.

S307, performing parameter adjustment on the first image classification network based on a weighted sum of the first loss, the second loss, and the third loss.

This step differs from the previous step S105 in that the first image classification network is parametrically adjusted based on the first, the second and the third loss, alternatively the first image classification network may be parametrically adjusted based on a weighted sum of the first, the second and the third loss, i.e. the first image classification network is adjusted based on the losses of the three loss functions.

And S308, repeating the training for a plurality of times to obtain a first image classification network after the training is completed.

This step is similar to the aforementioned step S106, and will not be described again here.

In the embodiment of the disclosure, in the training process of the first image classification network, the first prediction probability distribution of the similar image samples is further constrained through the third loss function, so that the similar image samples predict similar probability distribution, that is, the profile of the similar samples is tighter through distribution consistency regularization, which is further beneficial to improving the classification performance of the first image classification network.

Referring to fig. 4, in some embodiments, there is also provided an image classification method including the following S401 to 402:

s401, acquiring an image to be classified.

The image may be an image captured by an image capturing device, or may be video data captured by an image capturing device, and an image frame obtained after decoding the video data.

The image to be classified may be an image in any scene, for example, an image in an automatic driving scene, environmental objects around an automobile may be determined by classifying the image in the scene, or an image in an autonomous mobile robot scene, for example, a classification of an image photographed by a home robot may be performed, and a classification of an object existing in an environment where the robot is located may be determined.

S402, inputting the images to be classified into an image classification network to obtain classification results; the image classification network is obtained by the method for generating an image classification network according to any of the foregoing embodiments.

The image classification network, that is, the trained first image classification network in the foregoing embodiment, thus improving the accuracy of the classification result of the image to be classified.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same technical concept, the embodiment of the disclosure further provides a device for generating an image classification network, which corresponds to the method for generating an image classification network, and since the principle of solving the problem of the device in the embodiment of the disclosure is similar to that of the method for generating an image classification network in the embodiment of the disclosure, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 5, a schematic diagram of an apparatus 500 for generating an image classification network according to an embodiment of the disclosure is provided, where the apparatus 500 for generating an image classification network includes:

The sample acquiring module 501 is configured to acquire, in each training round, an image sample set of the training round; the image sample set comprises at least two sample images, and each sample image is marked with a real category label of each sample image;

The result detection module 502 is configured to input the image sample set into a first image classification network and a second image classification network, respectively, to obtain a first output result output by the first image classification network and a second output result output by the second image classification network in the present training, where the second image classification network is obtained based on the first image classification network after the adjustment parameters obtained in the completed at least one training;

A first determining module 503, configured to determine a first loss of a preset first loss function based on a first output result obtained by the present training and a real class label corresponding to each sample image;

A second determining module 504, configured to determine, based on the first output result of the present training and the historical second output result, a second loss of a preset second loss function, where the second loss function characterizes a distance between the first output result and a sample image in which a similarity in the historical second output result is greater than a preset threshold; the historical second output result comprises a second output result obtained in at least one round of historical training before the round of training;

A parameter adjustment module 505, configured to perform parameter adjustment on the first image classification network based on the first loss and the second loss;

And the iterative training module 506 is configured to repeat the training for a plurality of rounds, and obtain a first image classification network after the training is completed.

In a possible embodiment, referring to fig. 6, the apparatus further includes a tag generating module 507 and a third determining module 508, where the tag generating module 507 is configured to:

the third determining module 508 is configured to:

the parameter adjustment module 505 is specifically configured to:

In a possible implementation manner, the first output result of the present training includes a first feature of each sample image of the present training and a first prediction probability distribution of each sample image of the present training, wherein the first prediction probability distribution characterizes a distribution of probabilities that each sample image belongs to each category of a preset plurality of categories; the first determining module 503 is specifically configured to:

In a possible implementation manner, the historical second output result comprises a second characteristic of each sample image acquired in at least one round of training; the second determining module 504 is specifically configured to:

And determining a second loss of the second loss function based on the first characteristic of each sample image of the current training and the second characteristic of each sample image of the history under at least one training.

In a possible implementation manner, the first output result of the present training includes a first prediction probability distribution of each picture of the present training, the second output result of the present training includes a second feature of each sample image of the present training, the historical second output result includes a second feature of each sample image output by a second image classification network in at least one cycle of the history training and a second prediction probability distribution of each sample image, and the second prediction probability distribution represents a probability distribution of each sample image belonging to each of a preset plurality of categories; the tag generation module 507 is specifically configured to:

the third determining module 508 is specifically configured to:

In a possible implementation, the apparatus further comprises a result access module 509, the result access module 509 being configured to:

In a possible implementation, the result access module 509 is further configured to:

Referring to fig. 7, an embodiment of the present disclosure provides an image classification apparatus 700, including:

an image acquisition module 701, configured to acquire an image to be classified;

The image classification module 702 is configured to input the image to be classified into an image classification network to obtain a classification result; the image classification network is obtained by the method of any of the preceding embodiments.

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

Based on the same technical concept, the embodiment of the disclosure also provides electronic equipment. Referring to fig. 8, a schematic structural diagram of an electronic device 800 according to an embodiment of the disclosure includes a processor 801, a memory 802, and a bus 803. The memory 802 is used for storing execution instructions, including a memory 8021 and an external memory 8022; the memory 8021 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 801 and data exchanged with an external memory 8022 such as a hard disk, and the processor 801 exchanges data with the external memory 8022 via the memory 8021.

In the embodiment of the present application, the memory 802 is specifically configured to store application program codes for executing the scheme of the present application, and the processor 801 controls the execution. That is, when the electronic device 800 is operating, communication between the processor 801 and the memory 802 via the bus 803 causes the processor 801 to execute the application code stored in the memory 802, thereby performing the methods described in any of the preceding embodiments.

The Memory 802 may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.

The processor 801 may be an integrated circuit chip with signal processing capabilities. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 800. In other embodiments of the application, electronic device 800 may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of generating an image classification network in the method embodiments described above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiments of the present disclosure further provide a computer program product, where the computer program product carries a program code, where instructions included in the program code may be used to perform the steps of the method for generating an image classification network in the foregoing method embodiments, and specifically reference the foregoing method embodiments will not be described herein.

Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of generating an image classification network, comprising:

Determining a second loss of a preset second loss function based on the first output result and the historical second output result of the round of training, wherein the second loss function represents the distance between the first output result and a sample image with similarity larger than a preset threshold value in the historical second output result; the historical second output result comprises the second output result obtained in at least one round of historical training before the round of training;

2. The method of claim 1, wherein prior to the parameter adjustment of the first image classification network based on the first loss and the second loss, the method further comprises:

3. A method according to claim 1 or 2, wherein the second image classification network is obtained by a running average of the first image classification network after the adjustment parameters obtained in the completed at least one training round.

4. The method of claim 1, wherein the first output of the present run includes a first feature of each sample image of the present run and a first predictive probability distribution of the each sample image of the present run, the first predictive probability distribution characterizing a distribution of probabilities that the each sample image belongs to each of a preset plurality of categories;

5. The method of claim 4, wherein the historical second output comprises a second characteristic of each sample image acquired in at least one training round of the history;

6. The method of claim 2, wherein the first output of the present run comprises a first predictive probability distribution for each picture of the present run, the second output of the present run comprises a second feature for each sample image of the present run, the historical second output comprises a second feature for each sample image output by a second image classification network during at least one of the runs of the history, and a second predictive probability distribution for each sample image, the second predictive probability distribution characterizing a distribution of probabilities that each sample image belongs to each of a predetermined plurality of categories;

and determining a third loss of the third loss function based on the virtual class label of each sample image of the present training and the first prediction probability distribution of each sample image of the present training.

7. The method of claim 5 or 6, wherein prior to determining the second penalty of the preset second penalty function based on the first output result of the present run of training and the historical second output result, the method further comprises:

8. The method according to claim 7, wherein the second output result obtained by the present training is added to the buffer if the number of second features contained in the buffer is smaller than the fixed value; or alternatively

9. An image classification method, comprising:

Acquiring an image to be classified;

Inputting the images to be classified into an image classification network to obtain classification results; the image classification network is obtained by the method of any one of claims 1-8.

10. An apparatus for generating an image classification network, the apparatus comprising:

11. An image classification apparatus, the apparatus comprising:

The image acquisition module is used for acquiring images to be classified;

The image classification module is used for inputting the images to be classified into an image classification network to obtain classification results; the image classification network is obtained by the method of any one of claims 1-8.

12. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of generating an image classification network according to any of claims 1-8 and/or the steps of the image classification method according to claim 9.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when run by a processor, performs the steps of the method of generating an image classification network according to any of claims 1-8 and/or the steps of the image classification method according to claim 9.