CN113762304A

CN113762304A - Image processing method, image processing device and electronic equipment

Info

Publication number: CN113762304A
Application number: CN202011351182.9A
Authority: CN
Inventors: 刘浩; 徐卓然; 董博
Original assignee: Beijing Jingdong Qianshi Technology Co Ltd
Current assignee: Beijing Jingdong Qianshi Technology Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-12-07
Anticipated expiration: 2040-11-26
Also published as: CN113762304B

Abstract

The disclosure provides an image processing method, an image processing apparatus and an electronic device. The image processing method comprises the following steps: acquiring an input image; and processing the input image with an incremental learning network to determine an image recognition result, wherein the incremental learning network comprises: the network comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, and the main network and each of the at least two branch networks form a classification network for one specified category; the output of the main network is respectively used as the respective input of at least two branch networks, and the branch networks are the minimum increment units of the increment learning network.

Description

Image processing method, image processing device and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method, an image processing apparatus, and an electronic device.

Background

Deep learning techniques have made tremendous advances in areas such as object classification, text processing, recommendation engines, image search, face recognition, age recognition and speech recognition, human-computer conversations, and emotion calculation.

In implementing the disclosed concept, the inventors found that there is at least the following problem in the prior art, and the related art has difficulty in taking into account the plasticity, stability and performance of the network when a new category needs to be learned using the network.

Disclosure of Invention

In view of the above, the present disclosure provides an image processing method, an image processing apparatus, and an electronic device that can take account of network plasticity, stability, and performance.

One aspect of the present disclosure provides an image processing method, including: acquiring an input image; and processing the input image with an incremental learning network to determine an image recognition result, wherein the incremental learning network comprises: the incremental learning network comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, the main network and each of the at least two branch networks form a classification network aiming at one specified category, the output of the main network is respectively used as the input of each of the at least two branch networks, and the branch networks are the minimum incremental units of the incremental learning network.

According to an embodiment of the present disclosure, the backbone network includes at least one sequentially connected backbone module, the backbone module includes a convolutional layer and at least one of the following layers: a transform reconstruction layer, an activation function layer, and a pooling layer.

According to an embodiment of the present disclosure, a branching network includes a latent convolutional layer, a global average pooling layer, and a full convolutional layer connected in sequence.

According to an embodiment of the present disclosure, the output of the branching network is one of two categories.

According to an embodiment of the present disclosure, an incremental learning network is trained by: regarding the branch network of the specified category, taking the training data of the specified category as a positive sample, and taking the training data outside the specified category as a negative sample; and training the incremental learning network with the positive and/or negative samples.

According to an embodiment of the present disclosure, training an incremental learning network with positive and/or negative samples comprises: if the class of the training data is different from the existing class of the historical training data, adding a branch network aiming at the class of the training data, and performing model training on the added branch network by taking at least part of the training data as a positive sample and at least part of the historical training data as a negative sample; and if the category of the training data belongs to the existing category of the historical training data, performing model training on the branch network corresponding to the category of the training data by taking at least part of the training data and at least part of the historical training data with the same category as a positive sample together and taking at least part of the historical training data with different category from the training data as a negative sample.

According to an embodiment of the present disclosure, training an incremental learning network with positive and/or negative samples comprises: if the category of the training data is different from the existing category of the historical training data, at least part of the training data is used as a negative sample, and the existing branch network in the incremental learning network is finely adjusted; and if the class of the training data belongs to the existing class of the historical training data, extracting positive samples and/or negative samples from the training data and the historical training data, and finely adjusting the existing branch network in the incremental learning network.

According to an embodiment of the present disclosure, the image processing method further includes: if the backbone network is not trained by the training data of the designated category, unlocking the network parameters of the backbone network, otherwise, locking the network parameters of the backbone network; and/or if the number of the branch networks of the main network is less than a preset number threshold, unlocking the network parameters of the main network, otherwise, locking the network parameters of the main network.

According to an embodiment of the present disclosure, the image processing method further includes: determining representative training data based on at least one of training data, historical training data that is the same as the category of the training data, and historical training data that is different from the category of the training data; constructing a sample library based on the representative training data; and training the incremental learning network with the positive and/or negative samples comprises: and training the incremental learning network by using the positive samples and/or the negative samples in the sample library.

According to an embodiment of the present disclosure, the total amount of data of the sample library is related to the hardware performance of the electronic device used to train the incremental learning model.

According to an embodiment of the present disclosure, processing an input image with an incremental learning network to determine an image recognition result includes: obtaining the confidence of the processing result of each of the at least two branch networks for the input image; sequentially splicing the confidence degrees of the processing results according to the respective sequence of the at least two branch networks; and taking the category of the branch network corresponding to the highest confidence level position as the output of the incremental learning network.

According to an embodiment of the present disclosure, the input image is an image for an automatic driving task.

Another aspect of the present disclosure provides an image processing apparatus including: the image acquisition module is used for acquiring an input image; the image processing module is used for processing the input image by utilizing the incremental learning network to determine an image recognition result; wherein, incremental learning network includes: the incremental learning network comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, the main network and each of the at least two branch networks form a classification network aiming at one specified category, the output of the main network is respectively used as the input of each of the at least two branch networks, and the branch networks are the minimum incremental units of the incremental learning network.

Another aspect of the present disclosure provides an electronic device comprising one or more processors and a storage, wherein the storage is configured to store executable instructions that, when executed by the processors, implement the method as described above.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

FIG. 1 is a diagram of a convolutional neural network and its operation in the related art;

FIG. 2 is a diagram illustrating a structure of a tree convolutional neural network in the related art;

fig. 3 is an exemplary system architecture to which the image processing method, the image processing apparatus, and the electronic device may be applied according to an embodiment of the present disclosure.

FIG. 4 is a schematic structural diagram of an incremental learning network according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a backbone network according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a branched network according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a branched network according to another embodiment of the present disclosure;

FIG. 8 is a flow chart of a training method for an incremental learning network according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of model training of a branch network according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of model training of a branch network according to another embodiment of the present disclosure;

FIG. 11 is a flow chart of an image processing method according to an embodiment of the present disclosure;

fig. 12 is a schematic diagram of a network derivation process according to an embodiment of the disclosure;

fig. 13 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure; and

fig. 14 is a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The related art may use a network including convolutional layers for feature extraction, classification, and the like, and in order to facilitate understanding of the embodiments of the present disclosure, a convolutional neural network and its operation will be first exemplified.

Fig. 1 is a schematic diagram of a convolutional neural network and its operation process in the related art.

As shown in the upper diagram of fig. 1, after an input image is input to the convolutional neural network through an input layer, a category identifier is output after a plurality of processing procedures are performed in sequence. The main components of a convolutional neural network may include a plurality of convolutional layers, a plurality of downsampling layers, and a fully-connected layer. For example, a complete convolutional neural network may be composed of a stack of these three layers. The convolutional neural network as shown in the upper diagram of fig. 1 includes a first hierarchy, a second hierarchy, a third hierarchy, and the like. For example, each level may include one convolutional layer and one downsample layer. Thus, the processing procedure of each hierarchy may include: the input image is convolved (convolution) and downsampled (sampling).

Convolutional layers are the core layers of convolutional neural networks. In the convolutional layer of the convolutional neural network, one neuron is connected with only part of the neurons of the adjacent layer. The convolution layer may apply several convolution kernels, also called filters (filters), to the input image to extract various types of features of the input image. Each convolution kernel may extract one type of feature. The convolution kernel is generally initialized in the form of a random decimal matrix, and the convolution kernel can be learned to obtain a reasonable weight in the training process of the convolutional neural network. As shown in the lower graph of fig. 1, the result obtained after applying one convolution kernel to the input image is called a feature map (feature map), and the number of feature maps is equal to the number of convolution kernels. Each feature map is composed of a plurality of neurons arranged in a rectangle, and the neurons of the same feature map share a weight, wherein the shared weight is a convolution kernel. The feature map output by a convolutional layer of one level may be input to an adjacent convolutional layer of the next level and processed again to obtain a new feature map.

For example, the convolutional layer may use different convolutional cores to convolve data of a local perceptual domain of an input image, and the convolution result is input to an active layer, which performs calculation according to a corresponding activation function to obtain feature information of the input image.

A downsampling layer is disposed between adjacent convolutional layers, which is a form of downsampling. On one hand, the down-sampling layer can be used for reducing the scale of an input image, simplifying the complexity of calculation and reducing the phenomenon of overfitting to a certain extent; on the other hand, the downsampling layer may perform feature compression to extract main features of the input image. The downsampling layer can reduce the size of the feature maps without changing the number of feature maps. For example, an input image of size 12 × 12, which is sampled by a convolution kernel of 6 × 6, then a 2 × 2 output image can be obtained, which means that 36 pixels on the input image are combined to 1 pixel in the output image. The last downsampled or convolutional layer may be connected to one or more fully-connected layers that are used to connect all the extracted features. The output of the fully connected layer is a one-dimensional matrix, i.e., a vector.

When given an input data distribution D₁The deep learning model such as the convolutional neural network shown in FIG. 1 is intended to learn the distribution D by adjusting the weights of the neurons₁. When there is a new data set (distribution D)₂) Hopefully, the model can identify two data sets simultaneously, and the most intuitive way is to directly train the collection of the two data sets and let the model learn D₁And D₂The joint distribution of (1). Actually when learning D₂When D is₁Is often already unavailable for D₁Weights at learning D₂Is modified, resulting in pair D₁The problem can be referred to as plasticity and stability dilemma.

In order to implement Incremental training under the deep Learning framework, some attempts have been made by academics, such as the incorporated classifier and reconstruction Learning (iCaRL), which usually retain a small amount of historical samples and implement the migration of the knowledge of the old model to the new model by means of knowledge distillation. However, the structure of the network of these methods is usually fixed, i.e. the network structure does not change as the learning class increases. Such characteristics may cause the learning accuracy of the model to decrease significantly as the class increases, which is exactly the opposite of the learning ability of human beings. The reason for this is that when the model is updated, even though training is not biased to new data by distillation, all neurons participate in updating, the learned weights are inevitably changed by the new data, and as the process is iterated continuously, the recognition capability of the model to old data is reduced continuously.

For example, in tasks such as automatic driving, a model such as VGG (Visual Geometry Group Network), Residual Network (ResNet), lightweight Network (mobilnet), and the like may be used. However, the learning mechanism of these models is batch-learning (batch-learning), and if a new class needs to be learned, the whole model needs to be retrained, which consumes a lot of computation power and time, resulting in long iteration period and high development cost of the model. In contrast, incremental learning (incremental learning) is intended to enable a model to have the ability to learn continuously, i.e., the model can learn new classes while maintaining the ability to identify existing classes.

Fig. 2 is a schematic structural diagram of a tree convolutional neural network in the related art.

Some pioneering scholars have explored the use of dynamic network structures to solve this problem, such as the tree convolutional neural network (TreeCNN). As shown in fig. 2, these methods usually use a tree structure to extend the network, and the network starts with a super-class (super-class) as the root node, and then deepens the network structure according to a specific extension rule. This approach has two non-negligible problems.

On the one hand, since the input order of the categories cannot be controlled, for example, only one category such as C exists in the current network₁At this time, a new category (C)₂) Is input to the network, which will be from C₁Branch de-spreading of C₂By analogy, assume we have five categories in total C₁，C₂，C₃，C₄，C₅Finally we are to distinguish C₁And C₅The network structure may have been extended very deeply and in practice the model does not need such a deep structure to distinguish between the two categories, which is obviously a problem caused by the design of the algorithm. As shown in FIG. 2, the categories of the newly added leaf nodes are sheep and birds, respectively, in order to distinguish cats from birdsBirds need to sequentially pass through leaf nodes for categories of cats, dogs, horses, sheep, and birds, which causes a waste of computing resources. In the case of automatic driving, this approach is difficult to use in practice due to limited computational power of hardware.

On the other hand, for a leaf node (leaf node), since the leaf node only sees its corresponding class and the classes of several other leaf nodes participating in training at that time, the absence of data amount may cause the leaf node to be easily recognized by mistake when processing unseen classes.

The incremental learning network provided by the embodiment of the disclosure provides a new dynamic network structure for overcoming the disadvantages of the inherent network structure in the related art, so that the network branches corresponding to the existing categories are maintained (if not updated or fine-tuned) when learning the new categories, thereby maintaining the identification capability of the existing categories. On the other hand, the network learns new category data by adding new branches. Different from the deep growth mode of the tree convolution neural network in the related technology, the growth of the incremental learning network uses a width expansion strategy, so that each branch can be subjected to parallel calculation during network derivation, and performance loss caused by the deep growth mode is avoided. In addition, the embodiment of the disclosure further provides a strategy for iterative update of the existing category, so that the existing branch network can correctly process the new category, and the identification precision is improved.

Fig. 3 is an exemplary system architecture to which the image processing method, the image processing apparatus, and the electronic device may be applied according to an embodiment of the present disclosure. It should be noted that fig. 3 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 3, a system architecture 300 according to this embodiment may include

terminal devices

301, 302, 303, a network 304, and a server 305. The network 304 serves as a medium for providing communication links between the

terminal devices

301, 302, 303 and the server 305. Network 304 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal device

301, 302, 303 to interact with the server 305 via the network 304 to receive or send messages or the like. The

terminal devices

301, 302, 303 may have installed thereon various communication client applications, such as navigation-type applications, image processing-type applications, shopping-type applications, web browser applications, search-type applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

301, 302, 303 may be various electronic devices with image processing capabilities, including but not limited to vehicles, smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 305 may be a server that provides various services, such as a background management server for the incremental learning network requested by the

terminal devices

301, 302, 303 (for example only). The background management server may analyze and otherwise process the received data such as the request, and feed back a processing result (for example, topology information, model parameter information, image recognition result, and the like of the incremental learning network) to the terminal device.

It should be noted that the incremental learning network provided by the embodiment of the present disclosure may be applied to a terminal device or a server, and the training method and the image processing method provided by the embodiment of the present disclosure may be executed by the

terminal device

301, 302, 303 or the server 305. The training method and the image processing method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 305 and is capable of communicating with the

terminal devices

301, 302, 303 and/or the server 305.

It should be understood that the number of terminal devices, networks, and servers are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

An aspect of the present disclosure provides an image processing method.

The image processing method may include the following operations.

First, an input image is acquired. The input image is then processed using an incremental learning network to determine an image recognition result. Wherein, the incremental learning network can include: a backbone network and at least two branch networks.

The incremental learning network and the training method thereof are respectively illustrated in the following with reference to fig. 4 to 10.

Fig. 4 is a schematic structural diagram of an incremental learning network according to an embodiment of the present disclosure.

As shown in fig. 4, the incremental learning network may include: a backbone network and at least two branch networks. Wherein, each of the at least two branch networks corresponds to a different designated class respectively, and the trunk network and each of the at least two branch networks form a classification network for a designated class. The output of the main network is respectively used as the input of at least two branch networks, and the branch networks are the minimum increment units of the increment learning network. The network topology structure can give consideration to plasticity, stability and performance.

As shown in fig. 4, the backbone network is connected to branch networks of the categories "cat", "dog", and "car", respectively. When the information of a new category needs to be identified, a new branch network can be set for the new category. Because the branch networks are in an equal relation and have no operation sequence, the branch networks corresponding to the existing classes can be maintained when learning the new classes, and the identification capability of the existing classes is maintained. On the other hand, the network learns new category data by adding new branches.

In one embodiment, the backbone network includes at least one sequentially connected backbone module, the backbone module including a convolutional layer and at least one of: a transform reconstruction layer, an activation function layer, and a pooling layer.

Fig. 5 is a schematic structural diagram of a backbone network according to an embodiment of the present disclosure.

As shown in fig. 5, the backbone network may include a plurality of backbone modules, each of which may be identical or different in structure. For example, the backbone module may include a convolutional layer, a switching reconstruction layer, an activation layer, and a pooling layer connected in sequence.

The switching reconstruction layer (BN layer) is one layer of the network, like the convolutional layer, the active layer, and the all-connected layer. Such as a BN layer, may be placed before the activation function layer. Specifically, to implement a preprocessing (e.g. normalizing the input) at each layer of input, transformation reconstruction is performed on the normalization parameters to avoid the influence of the normalization operation on the learned features. Therefore, the learned characteristics of a certain layer of the original network can be recovered, and the network can recover the characteristic distribution to be learned by the original network by introducing the learnable parameters.

Full convolutional network (FCN for short). The FCN converts a full connection layer in a traditional Convolutional Neural Network (CNN) into a convolutional layer one by one, and outputs a labeled graph.

In this embodiment, the backbone network is mainly used to extract general features from input data, so as to output the general features to the branch network, so that the branch network performs classification prediction and the like based on the general features.

It should be noted that the backbone network may directly adopt one or more layers of the trained model that are shallower.

For example, the backbone network may use the current mainstream convolutional neural network, such as VGG, ResNet, depth separable convolution (Xception), MobileNet, and the like. The mid-low partial network of these models is cut out as the backbone network (e.g., the part before conv4-3 of VGG, the part before midlet flow of Xinstructions). The backbone network is responsible for extracting the general features of the image. In engineering problems (such as automatic driving scenes), a pre-trained model (pre-trained model) in ImageNet can be directly used, and then a corresponding layer is directly locked, and only a branch network part is trained.

In one embodiment, the branching network includes a latent convolutional layer, a global average pooling layer, and a full convolutional layer connected in sequence. In this embodiment, the branch network is the smallest unit of network growth, which corresponds to a particular category in the data set.

Fig. 6 is a schematic structural diagram of a branched network according to an embodiment of the present disclosure.

As shown in fig. 6, each branching network may include a plurality of branching modules having different structures, for example, a first branching module may sequentially include a convolutional layer, a transition reconstruction layer, and an active layer, a second branching module may sequentially include a convolutional layer and a transition reconstruction layer, and a third branching module may include a full convolutional layer. It should be noted that the structure of the branch network should correspond to the structure of the main network, so that the main network and the branch network can be combined into a complete classification network.

In one embodiment, the output of the branching network is one of two categories.

In order to make a small network scale have good classification capability, the branch network is realized in a two-class classification form: the current class is positive samples, and the remaining learned samples are all negative samples. The branch network is updated iteratively along with the expansion of the data category, and the types of the negative samples seen by the branch network are correspondingly increased along with the increase of the number of the branches, so that the identification capability of the positive samples is improved.

Fig. 7 is a schematic structural diagram of a branched network according to another embodiment of the present disclosure.

When a VGG is selected as the backbone network, the adapted branch network may comprise both shallower convolutional layers and full convolutional layers, as shown in fig. 7. FIG. 7 includes a plurality of shallower convolutional layers, a Global average pool (Global average pool), and a plurality of full convolutional layers.

Wherein the respective structures of the plurality of shallower convolutional layers may be the same. For example, it uses convolution kernels with size 3 x 3(filter 3 x 3), channel 512(channel 512), step size 1(stride 1), and same pixel fill (padding same), which may combine the BN, active, and pooling layers as a unit. The respective structures of the plurality of full convolutional layers may be different, for example, the first full convolutional layer includes: filter 1 x 1, channel 256, stride 1, padding same, which may combine the BN layer and the activation layer as a unit. The second full convolutional layer comprises: filter 1 x 1, channel 64, stride 1, padding same, which may combine the BN layer and the activation layer as one unit. The third full convolutional layer comprises: filter 1 x 1, channel 2, stride 1, padding same, which may bind the BN layer as a unit.

The incremental learning network provided by the embodiment of the disclosure comprises a plurality of branch networks with parallel connection relations, and new category data can be learned by adding a new branch network, so that network branches corresponding to existing categories are maintained when learning the new categories, and the identification capability of the existing categories is maintained. Different from a deep growth mode, the growth of the incremental learning network uses a width expansion strategy, so that each branch can be subjected to parallel computation during network derivation, and performance loss caused by the deep growth mode is reduced.

Another aspect of the present disclosure provides a training method for the incremental learning network as shown above.

Fig. 8 is a flow chart of a training method for an incremental learning network according to an embodiment of the present disclosure.

As shown in fig. 8, the training method may include operations S802 to S804.

In operation S802, for a branch network of a specified class, training data of the specified class is taken as a positive sample, and training data outside the specified class is taken as a negative sample.

In this embodiment, the positive samples and the negative samples determined based on the existing training data can be quickly obtained through the above method, and each branch network can be trained by the training data of each category, which is helpful for improving the accuracy of the output result of the model.

In operation S804, the incremental learning network is trained using the positive and/or negative samples.

In this embodiment, the incremental learning network may be trained with positive or negative examples based on, for example, a back propagation algorithm to determine model parameters for each branch network.

In one embodiment, training the incremental learning network with positive and/or negative examples may include the following operations.

On one hand, if the class of the training data is different from the existing class of the historical training data, adding a branch network aiming at the class of the training data, and performing model training on the added branch network by taking at least part of the training data as a positive sample and at least part of the historical training data as a negative sample. This allows new category data to be learned by adding new branch networks without retraining existing branch networks.

On the other hand, if the category of the training data belongs to the existing category of the historical training data, model training is performed on the branch network corresponding to the category of the training data by taking at least part of the training data and at least part of the historical training data having the same category as each other as a positive sample and at least part of the historical training data having a different category from the training data as a negative sample. Therefore, each branch network can conveniently experience new training data, and the accuracy of the model prediction result is improved.

On the one hand, if the class of the training data is different from the existing class of the historical training data, at least part of the training data is used as a negative sample, and the existing branch network in the incremental learning network is subjected to fine adjustment. Therefore, the existing branch network can be finely adjusted by the training data of the new category, the risk of misjudgment of the existing branch network on the new category data is reduced, meanwhile, due to the fact that negative samples of the existing branch network are continuously increased, performance loss of the existing branch network on the corresponding category data is avoided, and the accuracy of network prediction is improved.

On the other hand, if the class of the training data belongs to the existing class of the historical training data, extracting positive samples and/or negative samples from the training data and the historical training data, and finely adjusting the existing branch network in the incremental learning network.

Fig. 9 is a schematic diagram of model training of a branch network according to an embodiment of the present disclosure.

As shown in fig. 9, for a new class of training data, a new branch network may be set for the new class. When training, at least part of the training data of the new category may be used as a positive sample of the newly added branch network, and at least part of the training data of the new category may be used as a negative sample of the existing branch network. Of course, historical training data of existing classes can be used as negative samples of the newly added branch network.

Fig. 10 is a schematic diagram of model training of a branch network according to another embodiment of the present disclosure.

As shown in fig. 10, the type of the new training data is an automobile, and the existing branch networks include branch networks of which the types are automobiles, so that no branch network needs to be added, and in order to make all branch networks experience the new training data, the new training data may be used as a positive sample of the branch network of the automobile type, and at least part of historical training data of non-automobile types may be used as a negative sample of the branch network of the automobile type. The positive and negative examples of the other branch networks are similar.

In one embodiment, the training method may further include the following operations.

On one hand, if the backbone network is not trained by the training data of the specified category, the network parameters of the backbone network are unlocked, otherwise, the network parameters of the backbone network are locked.

On the other hand, if the number of the branch networks of the main network is less than the preset number threshold, the network parameters of the main network are unlocked, otherwise, the network parameters of the main network are locked.

For example, since the backbone network may be used to extract general features of each category, the network parameters of the backbone network are applicable to a large number of scenarios, and may be reduced by locking the network parameters of the backbone network.

Under the condition of less learned categories, the backbone network needs to participate in the back propagation of errors during training, and along with the increase of the number of learned categories, the backbone network can be locked during training and does not participate in the training. In engineering problems (such as automatic driving scenes), a pre-trained model (pre-trained model) in ImageNet can be directly used, and then a corresponding layer is directly locked, and only a branch network part is trained.

First, representative training data is determined based on at least one of training data, historical training data that is the same as the category of the training data, and historical training data that is different from the category of the training data.

A sample library is then constructed based on the representative training data.

Accordingly, training the incremental learning network with positive and/or negative examples may include: and training the incremental learning network by using the positive samples and/or the negative samples in the sample library.

For example, a similar approach to BiC may be taken, and for each category of data, a portion of the data is retained as representative samples (representational exemplars) to form a sample library. Unlike BiC, this portion of data can be recalled when training the branch network to train in coordination with the dynamic combination of new classes of data into a data set. The upper limit of the total data amount of the sample library is N, N is a hyper-parameter, and the value of the hyper-parameter depends on the memory size and the computing power of computing platform hardware. Wherein the total amount of data of the sample library is related to hardware performance of the electronic device used for training the incremental learning model.

For example, the class corresponding to the non-current branch may be dynamically sampled as a negative sample, which ensures that the existing network branch also sees the current new data.

In the following, for an actual scenario of automatic driving, the VGG is taken as an example for the backbone network, and a training method of the incremental learning network is provided.

First, the ImageNet pre-training model for VGG is loaded and all layers before conv4-3 are locked as the backbone network.

Then, a category is selected from the data set and input into the network.

If the input data set is of a new category, one branch is added to the back of the backbone network (in parallel with the other branches if they are). And taking the current class data as a positive sample, and taking the representative samples of all other classes learned by the network as negative samples, and training the branch network. To balance the effects of the positive and negative sample number differences, a penalty function (loss function) may use binary focal loss, as shown in equation 1.

L＝-α_t(1-P_t)^γlog(P_t) Formula (1)

Wherein y is a true class (ground-truth), P is the probability of a positive sample, α is 0.5 and γ is 1 in the early stage of training, and α is 0.25 and γ is 2 in the later stage with the increase of samples in the sample library.

If the input data set is of an existing class, merging the data of the current class with the previously learned representative sample of the current class, taking the merged data as a positive sample, and taking the data of all other classes learned by the network as a negative sample, and training the branch network. The loss function is also binary local, as shown in equation 1. And updating the representative sample library of the category after training is finished.

Next, fine-tuning (fine-tuning) is performed on all branches that are not of the current category. During training, a representative sample of the current class data is added into a total sample base, and the class corresponding to the non-current branch is dynamically sampled to be used as a negative sample, so that the step ensures that the old branch also sees the current new data.

It should be noted that the activation function may use a linear correction unit (Relu). Relu makes the output of a part of neurons 0, thus causing the sparsity of the network, reducing the interdependence relation of parameters and alleviating the occurrence of the overfitting problem. Relu (x) ═ x (ifx > 0), Relu (x) ═ 0 (ifx. ltoreq.0), for example, plays a role in unilateral inhibition.

Another aspect of the present disclosure provides an image processing method.

Fig. 11 is a flowchart of an image processing method according to an embodiment of the present disclosure.

As shown in fig. 11, the image processing method may include operations S1102 and S1104.

In operation S1102, an input image is acquired.

For example, the input image is an image for an automatic driving task. Of course, the input image may also be an image for other tasks or fields, such as various scenes involving category recognition, etc.

In operation S1104, the input image is processed using an incremental learning network to determine an image recognition result.

The incremental learning network may be trained based on the above training method. The topology of the incremental learning network, network parameters, etc. may be as described above and will not be described in detail herein.

In one embodiment, processing the input image with the incremental learning network to determine the image recognition result may include the following operations.

First, the confidence of the processing result of each of the at least two branch networks for the input image is obtained.

Then, the confidence degrees of the processing results are spliced in sequence according to the respective sequence of the at least two branch networks.

And then, taking the type of the branch network corresponding to the highest confidence level position as the output of the incremental learning network.

Fig. 12 is a schematic diagram of a network derivation process according to an embodiment of the disclosure.

As shown in fig. 12, in the network derivation process, the confidences (confidence scores) of all the branch network outputs are spliced (con-localization), so as to pack all the branches into one model file, and after the input image passes through the backbone network, parallel calculation is automatically performed on each branch, so as to reduce the performance loss caused by the increase of the number of branches. The order of the splices corresponds to the category represented by each branch. If n types of branch networks exist, n confidence degrees after splicing can be obtained, wherein n is a positive integer larger than 0. Finally, the category corresponding to the position with the highest confidence coefficient is the output of the network (category judgment). As confidence in the branched network for the car category: the confidence 3 score is the highest, and the output of the incremental learning network is the car.

Another aspect of the present disclosure provides an image processing apparatus.

Fig. 13 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 13, the image processing apparatus 1300 may include an image acquisition module 1310 and an image processing module 1320.

The image obtaining module 1310 is used for obtaining an input image.

The image processing module 1320 is used to process the input image using the incremental learning network to determine the image recognition result.

For example, the incremental learning network includes: the incremental learning network comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, the main network and each of the at least two branch networks form a classification network aiming at one specified category, the output of the main network is respectively used as the input of each of the at least two branch networks, and the branch networks are the minimum incremental units of the incremental learning network.

For example, the incremental learning network is trained based on the training method shown above.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any number of the image acquisition module 1310 and the image processing module 1320 may be combined in one module to be implemented, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the image capturing module 1310 and the image processing module 1320 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware by any other reasonable manner of integrating or packaging a circuit, or in any one of three implementations of software, hardware, and firmware, or in any suitable combination of any of them. Alternatively, at least one of the image acquisition module 1310 and the image processing module 1320 may be at least partially implemented as a computer program module, which when executed, may perform corresponding functions.

Fig. 14 is a block diagram of an electronic device according to an embodiment of the disclosure. The electronic device shown in fig. 14 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 14, an electronic device 1400 according to an embodiment of the present disclosure includes a processor 1401, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1402 or a program loaded from a storage portion 1408 into a Random Access Memory (RAM) 1403. Processor 1401 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1401 may also include onboard memory for caching purposes. Processor 1401 may include a single processing unit or multiple processing units for performing different actions of a method flow according to embodiments of the present disclosure.

In the RAM 1403, various programs and data necessary for the operation of the system 1400 are stored. The processor 1401, the ROM 1402, and the RAM 1403 are connected to each other by a bus 1404. The processor 1401 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1402 and/or the RAM 1403. Note that the programs may also be stored in one or more memories other than the ROM 1402 and the RAM 1403. The processor 1401 may also perform various operations of the method flows according to the embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, system 1400 may also include an input/output (I/O) interface 1405, which input/output (I/O) interface 1405 is also connected to bus 1404. The system 1400 may also include one or more of the following components connected to the I/O interface 1405: an input portion 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage portion 1408 including a hard disk and the like; and a communication portion 1409 including a network interface card such as a LAN card, a modem, or the like. The communication section 1409 performs communication processing via a network such as the internet. The driver 1410 is also connected to the I/O interface 1405 as necessary. A removable medium 1411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1410 as necessary, so that a computer program read out therefrom is installed into the storage section 1408 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. The computer program, when executed by the processor 1401, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include one or more memories other than ROM 1402 and/or RAM 1403 and/or ROM 1402 and RAM 1403 described above.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. An image processing method comprising:

acquiring an input image; and

processing the input image with an incremental learning network to determine an image recognition result,

wherein the incremental learning network comprises: the incremental learning network comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, the main network and each of the at least two branch networks form a classification network for one specified category, the output of the main network is used as the input of each of the at least two branch networks, and the branch networks are the minimum incremental units of the incremental learning network.

2. The method of claim 1, wherein the backbone network comprises at least one sequentially connected backbone module comprising convolutional layers and at least one of: a transform reconstruction layer, an activation function layer, and a pooling layer.

3. The method of claim 1, wherein the branching network comprises sequentially connected latent convolutional layers, global average pooling layers, and full convolutional layers.

4. The method of claim 1, wherein the output of the branching network is one of two classifications.

5. The method of claim 1, wherein the incremental learning network is trained by:

for a branch network of a specified class, taking training data of the specified class as a positive sample, and taking training data outside the specified class as a negative sample; and

training the incremental learning network with the positive samples and/or the negative samples.

6. The method of claim 5, wherein the training the incremental learning network with the positive samples and/or the negative samples comprises:

if the class of the training data is different from the existing class of the historical training data, adding a branch network aiming at the class of the training data, and performing model training on the added branch network by taking at least part of the training data as a positive sample and at least part of the historical training data as a negative sample; and

and if the category of the training data belongs to the existing category of historical training data, at least part of the training data and at least part of the historical training data with the same category are jointly used as positive samples, and at least part of the historical training data with different categories with the training data are used as negative samples to carry out model training on the branch network corresponding to the category of the training data.

7. The method of claim 5, wherein the training the incremental learning network with the positive samples and/or the negative samples comprises:

if the class of the training data is different from the existing class of historical training data, at least part of the training data is used as a negative sample, and the existing branch network in the incremental learning network is finely adjusted; and

and if the category of the training data belongs to the existing category of historical training data, extracting positive samples and/or negative samples from the training data and the historical training data, and finely adjusting the existing branch network in the incremental learning network.

8. The method of claim 5, further comprising:

if the backbone network is not trained by the training data of the designated category, unlocking the network parameters of the backbone network, otherwise, locking the network parameters of the backbone network;

and/or

And if the number of the branch networks of the main network is less than a preset number threshold, unlocking the network parameters of the main network, otherwise, locking the network parameters of the main network.

9. The method of claim 5, further comprising:

determining representative training data based on at least one of the training data, historical training data that is the same as the category of the training data, and historical training data that is different from the category of the training data;

constructing a sample library based on the representative training data; and

the training the incremental learning network with the positive examples and/or the negative examples comprises: training the incremental learning network with positive and/or negative examples in the sample library.

10. The method of claim 9, wherein the total amount of data of the sample library is related to hardware performance of an electronic device used to train the incremental learning model.

11. The method of claim 1, wherein the processing the input image with an incremental learning network to determine an image recognition result comprises:

obtaining confidence degrees of processing results of the at least two branch networks respectively aiming at the input image;

sequentially splicing the confidence degrees of the processing results according to the respective sequence of the at least two branch networks; and

and taking the category of the branch network corresponding to the highest confidence coefficient position as the output of the incremental learning network.

12. The method of claim 1, wherein the input image is an image for an autonomous driving task.

13. An image processing apparatus comprising:

the image acquisition module is used for acquiring an input image;

an image processing module for processing the input image using an incremental learning network to determine an image recognition result; wherein the incremental learning network comprises: the incremental learning network comprises a main network and at least two branch networks, wherein each of the at least two branch networks corresponds to a different specified category, the main network and each of the at least two branch networks form a classification network for one specified category, the output of the main network is used as the input of each of the at least two branch networks, and the branch networks are the minimum incremental units of the incremental learning network.

14. An electronic device, comprising:

one or more processors;

storage means for storing executable instructions which, when executed by the processor, implement the image processing method of any one of claims 1 to 12.

15. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, implement an image processing method according to any one of claims 1 to 12.