CN114140858A

CN114140858A - Image processing method, device, equipment and storage medium

Info

Publication number: CN114140858A
Application number: CN202111484903.8A
Authority: CN
Inventors: 雷翔; 张发恩; 赵迪; 汤寅航
Original assignee: Chongqing Cisai Tech Co Ltd
Current assignee: Chongqing Cisai Tech Co Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-04

Abstract

The application provides an image processing method, an image processing device, an image processing apparatus and a storage medium, wherein the image processing method comprises the following steps: inputting the first image sample and the second image sample into a target neural network; training the target neural network based on the first image features, the second image features, and a loss function of the target neural network, the loss function including pixel loss, perceptual loss, immunity loss, and total variation loss; and identifying a target input image based on the target neural network obtained through training, obtaining a target output image and the like. According to the method and the device, the target neural network can learn a more accurate high-low resolution face image mapping relation based on pixel loss, perception loss, immunity loss and total variation loss, so that the difference between the high-resolution face image output by the target neural network and the true-value high-resolution face image is smaller, namely the output accuracy of the target neural network is improved.

Description

Image processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image processing method, an image processing apparatus, an image processing device, and a storage medium.

Background

In recent years, with the research of face recognition technology becoming more and more intensive, commercialization of face recognition systems has become possible. However, most of the existing mature face recognition systems are only applied to places such as subways, residential districts, office areas and the like, and face images acquired in the places are often required to be matched with an acquirer to complete recognition comparison, so that face recognition in the scenes is often controllable. In most practical situations, the monitored crowd is often in an unconstrained state due to the fact that the monitoring camera is far away from the monitored crowd. In such an uncontrolled environment, the images acquired by the monitoring camera are often low-quality small-sized, i.e., low-resolution face images. The low-resolution face image contains less useful information and more noise, and the analysis capability of a computer on the face is seriously weakened, so that the performance of the traditional face recognition algorithm is sharply reduced. To meet this challenge, human face super-resolution technology has gained wide attention in recent years.

Most of the current reconstruction-based super-resolution methods are more suitable for synthesizing local textures, and specific priori knowledge is not usually introduced into the methods, so that the methods are poor in performance on face images compared with natural scenes. In addition, most of face fantasy algorithms based on convolutional neural networks only consider the mean square error loss during design, although the loss can obtain a better objective index, when the resolution is reduced, the generated high-resolution images are often fuzzy, and how to introduce more effective high-frequency information through the design of a network structure and a loss function is also a problem existing in the face fantasy algorithms. Therefore, based on the importance of the network structure and the loss function design in deep learning to the performance, a face super-resolution algorithm for generating a confrontation network and a channel attention mechanism based on conditions needs to be realized.

Disclosure of Invention

An object of the embodiments of the present application is to provide an image processing method, an image processing apparatus, an image processing device, and a storage medium, which are used to solve the problem of image blur generated due to a mean square error loss function in a process of outputting a high-resolution image based on a low-resolution image, so as to improve the accuracy of the output high-resolution image.

To this end, a first aspect of the present application discloses an image processing method, the method comprising:

acquiring a first image sample and a second image sample, wherein the image resolution of the first image sample is lower than the resolution of the second image sample;

inputting the first image sample and the second image sample into a target neural network;

extracting image features of the first image sample and image features of the second image sample based on the target neural network, and obtaining first image features and second image features;

training a loss function of the target neural network based on the first image feature, the second image feature and the target neural network to obtain the target neural network, wherein the loss function is used as a constraint condition for converting a low-resolution image into a high-resolution image by the target neural network, and the loss function comprises pixel loss, perception loss, immunity loss and total variation loss;

and identifying a target input image based on the target neural network obtained through training, and obtaining a target output image, wherein the resolution of the target output image is higher than that of the target input image.

Compared with the prior art, the prior art only adopts the pixel loss to restrict the training of the neural network, and the accuracy of the output result of the obtained neural network is lower.

In the first aspect of the present application, as an optional implementation manner, before the obtaining the first image sample and the second image sample, the method further includes:

extracting at least two face images from a face database;

and extracting the face area of each face image and obtaining the first image sample and the second image sample.

In the first aspect of the present application, as an optional implementation manner, obtaining the first image sample and the second image sample from extracting a face region of each face image includes:

and extracting a face area of each face image based on an MTCNN face detection algorithm and obtaining the first image sample and the second image sample.

In the first aspect of the present application, as an optional implementation manner, the target neural network includes a generator, and the generator is of an Encoder-Decoder structure, where the Encoder-Decoder structure includes an Encoder and a Decoder;

and extracting the image features of the first image sample and the image features of the second image sample based on the target neural network, and obtaining the first image features and the second image features, including:

extracting image characteristics of the first image sample based on the encoder and the decoder, and obtaining first image characteristics;

and extracting the image characteristics of the second image sample based on the encoder and the decoder, and obtaining the second image characteristics.

In the first aspect of the present application, as an optional implementation manner, the encoder includes a first convolutional layer and a second convolutional layer, and a skip connection is provided between the first convolutional layer and the second convolutional layer;

and extracting image features of the first image sample based on the encoder and the decoder, and obtaining first image features, comprising:

performing convolution processing on the first image sample based on the first convolution layer to obtain a first feature map;

performing transpose convolution processing on the first image sample based on the second convolution layer to obtain a second feature map;

and fusing the first feature map and the second feature map based on the jump connection among the decoder, the first convolution layer and the second convolution layer to obtain the first image feature.

In the first aspect of the present application, as an optional implementation manner, the performing convolution processing on the first image sample based on the first convolution layer to obtain a first feature map includes:

acquiring channel attention information of the first convolution layer;

acquiring spatial attention information of the first convolution layer;

and performing convolution processing on the first image sample based on the first convolution layer, the channel attention information of the first convolution layer and the spatial attention information of the first convolution layer to obtain the first feature map.

In the first aspect of the present application, as an optional implementation manner, the performing a transpose convolution process on the first image sample based on the second convolution layer to obtain a second feature map includes:

acquiring channel attention information of the second convolutional layer;

acquiring spatial attention information of the second convolutional layer;

and performing transpose convolution processing on the first image sample based on the second convolutional layer, the channel attention information of the second convolutional layer, and the spatial attention information of the second convolutional layer to obtain the second feature map.

A second aspect of the present application discloses an image processing apparatus, the apparatus comprising:

a first obtaining module, configured to obtain a first image sample and a second image sample, where an image resolution of the first image sample is lower than an image resolution of the second image sample;

an input module for inputting the first image sample and the second image sample into a target neural network;

the extraction module is used for extracting the image characteristics of the first image sample and the image characteristics of the second image sample based on the target neural network and obtaining the first image characteristics and the second image characteristics;

a training module, configured to train to obtain the target neural network based on the first image feature, the second image feature, and a loss function of the target neural network, where the loss function is a constraint condition for the target neural network to convert a low-resolution image into a high-resolution image, and the loss function includes pixel loss, perceptual loss, immunity loss, and total variation loss;

and the recognition module is used for recognizing the target input image based on the target neural network obtained through training and obtaining a target output image, wherein the resolution of the target output image is higher than that of the target input image.

The device of the embodiment of the application can enable the target neural network to learn a more accurate high-low resolution face image mapping relation based on pixel loss, perception loss, immunity loss and total variation loss by executing the image processing method, so that the difference between the high-resolution face image output by the target neural network and the true high-resolution face image is smaller, namely the output accuracy of the target neural network is improved.

A third aspect of the present application discloses an image processing apparatus, the apparatus comprising:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the image processing method disclosed by the first aspect of the application.

By executing the image processing method, the device of the embodiment of the application can enable the target neural network to learn a more accurate high-low resolution face image mapping relation based on pixel loss, perception loss, immunity loss and total variation loss, so that the difference between the high-resolution face image output by the target neural network and the true high-resolution face image is smaller, namely the output accuracy of the target neural network is improved.

A fourth aspect of the present application discloses a storage medium storing computer instructions for executing the image processing method according to the first aspect of the present application when the computer instructions are invoked.

By executing the image processing method, the storage medium of the embodiment of the application can enable the target neural network to learn a more accurate high-low resolution face image mapping relation based on pixel loss, perception loss, immunity loss and total variation loss, so that the difference between the high-resolution face image output by the target neural network and the true high-resolution face image is smaller, namely the output accuracy of the target neural network is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Example one

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an image processing method according to an embodiment of the present disclosure. As shown in fig. 1, the image processing method according to the embodiment of the present application includes the following steps:

101. acquiring a first image sample and a second image sample, wherein the image resolution of the first image sample is lower than that of the second image sample;

102. inputting the first image sample and the second image sample into a target neural network;

103. extracting image features of the first image sample and image features of the second image sample based on the target neural network, and obtaining the first image features and the second image features;

104. training a target neural network based on the first image characteristics, the second image characteristics and a loss function of the target neural network, wherein the loss function is used as a constraint condition for converting the low-resolution image into the high-resolution image by the target neural network, and the loss function comprises pixel loss, perception loss, immunity loss and total variation loss;

105. and identifying the target input image based on the target neural network obtained by training, and obtaining a target output image, wherein the resolution of the target output image is higher than that of the target input image.

The image processing method of the embodiment of the application can convert a low-resolution image, which is acquired in a scene that a monitoring camera is far away from a monitored crowd and the monitored crowd is often in an unconstrained state, into a high-resolution image, wherein the first image sample and the second image sample are input into a target neural network, so that the target neural network can learn a mapping relationship between a low-resolution image and a high-resolution image, and further, the target neural network, which has learned the mapping relationship between the low-resolution image and the high-resolution image, can be used for converting a target input image with a lower resolution into a target output image with a higher resolution, for example, converting a target input image with a resolution of 50PPI into a target output image with a resolution of 100 PPI.

On the other hand, through the pixel loss, the perceptual loss, the countermeasure loss and the total variation loss in the loss function of the target neural network of the embodiment of the present application, the target output image can be made more accurate, for example, the target output image is made closer to the true value image, wherein through the pixel loss, the perceptual loss, the countermeasure loss and the total variation loss, the learning result of the target neural network can be constrained, so that the target neural network obtains the mapping relationship between the low resolution image and the high resolution image under the condition of the minimum pixel loss, the minimum perceptual loss, the minimum countermeasure loss and the minimum total variation loss. Specifically, the pixel loss is used for representing the difference between a high-resolution image predicted by the target neural network and a true-value high-resolution image, the perception loss is used for representing that an image output by the target neural network is more similar to the true-value high-resolution face image in style and perception, on the other hand, the anti-loss representation image output by the target neural network is more similar to the high-resolution face image of the true-value image, and the total variation loss is used for restraining the smoothing effect of the target neural network.

Specifically, in the embodiment of the present application, the target output image output by the target neural network and the true-value high-resolution image can be made smaller by the constraint of pixel loss; through the constraint of perception loss, the target output image output by the target neural network can be more similar to a true-value high-resolution face image in style and perception; by resisting the constraint of loss, the probability that a target output image of a target neural network is a true value high-resolution face image can be improved to be higher; and the target output image of the target neural network can be smoother through the constraint of the total variation loss.

Compared with the prior art, the prior art only adopts the pixel loss to restrict the training of the neural network, and the accuracy of the output result of the obtained neural network is lower. By executing the image processing method, the method of the embodiment of the application can enable the target neural network to learn a more accurate high-low resolution face image mapping relation based on pixel loss, perception loss, immunity loss and total variation loss, so that the difference between the high-resolution face image output by the target neural network and the true high-resolution face image is smaller, namely the output accuracy of the target neural network is improved.

In the embodiment of the present application, optionally, since each loss has a different output result to the target neural network, the pixel loss, the perceptual loss, the immunity loss, and the total degradation loss all have weighted values, and thus, by using the weighted value of the pixel loss, the weighted value of the perceptual loss, the weighted value of the immunity loss, and the weighted value of the total degradation loss, it is possible to fuse these losses and play different constraint roles based on different degrees of influence of each loss on the output result of the target neural network.

In the embodiment of the present application, the perceived loss is optionally VGG16 pre-trained based on the ImageNet data set.

In the embodiment of the present application, for step 101, the first image sample and the second image sample are paired, for example, the first image sample is a face image with a resolution of 50PPI, and the second image sample is a face image with a resolution of 100PPI, where the face of the first image sample and the face of the second image sample are the same. Further, the target neural network of the embodiments of the present application may be trained based on a plurality of pairs of training samples, for example, 1000 sets of training samples are used to train the target neural network, and each set of training samples includes a first image sample and a second image sample.

Further, in step 101, a MTCNN face detection algorithm can extract a face region from an original captured image, thereby obtaining a training test set, where the training test set includes a first image sample and a second image sample. Further, the original captured image is a picture in the CelebA and LFW face databases, that is, in this embodiment of the present application, as an optional implementation manner, before the first image sample and the second image sample are obtained, the method of this embodiment of the present application further includes the following steps:

extracting at least two face images from a face database;

and extracting the face area of each face image and obtaining a first image sample and a second image sample.

And extracting the face region of each face image and obtaining a first image sample and a second image sample, comprising the following substeps:

and extracting a face area of each face image based on an MTCNN (Multi-task convolutional neural network) face detection algorithm and obtaining a first image sample and a second image sample.

In the embodiment of the present application, as an optional implementation manner, the target neural network includes a generator, where the generator is an Encoder-Decoder structure, where the Encoder-Decoder structure includes an Encoder and a Decoder;

and extracting the image features of the first image sample and the image features of the second image sample based on the target neural network, and obtaining the first image features and the second image features, comprising the following substeps:

extracting image characteristics of the first image sample based on an encoder and a decoder, and obtaining first image characteristics;

In the embodiment of the present application, as an optional implementation manner, the encoder includes a first convolution layer and a second convolution layer, and a jump connection is provided between the first convolution layer and the second convolution layer;

and extracting image features of the first image sample based on an encoder and a decoder, and obtaining the first image features, comprising the following substeps:

performing convolution processing on the first image sample based on the first convolution layer to obtain a first characteristic diagram;

performing transposition convolution processing on the first image sample based on the second convolution layer to obtain a second characteristic diagram;

In the embodiment of the present application, the jump connection means that the second convolution layer performs processing based on the processing result of the first convolution layer.

In this embodiment, the encoder further includes a residual block, where each first convolution layer is connected to one residual block, and the residual block and the first convolution layer are commonly used to perform convolution processing on the first image sample to obtain the first feature map.

In this embodiment, as an optional implementation manner, performing convolution processing on the first image sample based on the first convolution layer to obtain the first feature map includes the following sub-steps:

acquiring channel attention information of the first convolution layer;

acquiring space attention information of the first convolution layer;

and performing convolution processing on the first image sample based on the first convolution layer, the channel attention information of the first convolution layer and the space attention information of the first convolution layer to obtain a first characteristic diagram.

In the embodiment of the present application, the channel attention information is used to determine what features the target neural network pays attention to, for example, a feature diagram of H × W × C (H × W represents a pixel size) is input, where C is the channel attention information, and the channel attention mechanism is introduced by performing weight configuration on the dimension information of the channel; the spatial attention information is used for carrying out weight configuration on each pixel point by taking each pixel point in the feature map as a unit.

In this embodiment, as an optional implementation manner, performing a transpose convolution process on the first image sample based on the second convolution layer to obtain the second feature map includes the following steps:

acquiring channel attention information of the second convolutional layer;

acquiring spatial attention information of the second convolutional layer;

and performing transposition convolution processing on the first image sample based on the second convolutional layer, the channel attention information of the second convolutional layer and the space attention information of the second convolutional layer to obtain a second characteristic diagram.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 2, the image processing apparatus according to the embodiment of the present application includes the following functional blocks:

a first obtaining module 201, configured to obtain a first image sample and a second image sample, where an image resolution of the first image sample is lower than a resolution of the second image sample;

an input module 202, configured to input the first image sample and the second image sample into a target neural network;

the extraction module 203 is configured to extract image features of the first image sample and image features of the second image sample based on the target neural network, and obtain the first image features and the second image features;

a training module 204, configured to train a target neural network based on the first image feature, the second image feature, and a loss function of the target neural network, where the loss function is a constraint condition for the target neural network to convert a low-resolution image into a high-resolution image, and the loss function includes pixel loss, perceptual loss, immunity loss, and total variation loss;

and the identifying module 205 is configured to identify the target input image based on the trained target neural network, and obtain a target output image, where a resolution of the target output image is higher than a resolution of the target input image.

Please refer to the detailed description of the first embodiment of the present application for other descriptions of the embodiments of the present application, which are not repeated herein.

EXAMPLE III

Referring to fig. 3, fig. 3 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the image processing apparatus of the embodiment of the present application includes:

a memory 302 storing executable program code;

a processor 301 coupled to a memory 302;

the processor 301 calls the executable program code stored in the memory 302 to execute the image processing method disclosed in the first embodiment of the present application.

Example four

The embodiment of the application discloses a storage medium, wherein a computer instruction is stored in the storage medium, and when the computer instruction is called, the storage medium is used for executing the image processing method according to the first embodiment of the application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of one logic function, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

It should be noted that the functions, if implemented in the form of software functional modules and sold or used as independent products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above embodiments are merely examples of the present application and are not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

training the target neural network based on the first image features, the second image features and a loss function of the target neural network, wherein the loss function serves as a constraint condition for the target neural network to convert a low-resolution image into a high-resolution image, and the loss function comprises pixel loss, perceptual loss, immunity loss and total variation loss;

2. The method of claim 1, wherein prior to acquiring the first image sample and the second image sample, the method further comprises:

extracting at least two face images from a face database;

3. The method of claim 2, wherein obtaining the first image sample and the second image sample from extracting a face region of each of the face images comprises:

4. The method of claim 1, wherein the target neural network comprises a generator that is an Encoder-Decoder structure, wherein the Encoder-Decoder structure comprises an Encoder and a Decoder;

5. The method of claim 4, wherein the encoder comprises a first convolutional layer and a second convolutional layer with a skip connection therebetween;

6. The method of claim 5, wherein said convolving the first image sample based on the first convolution layer to obtain a first feature map comprises:

acquiring channel attention information of the first convolution layer;

acquiring spatial attention information of the first convolution layer;

7. The method of claim 5, wherein the transposing convolution processing the first image samples based on the second convolution layer to obtain a second feature map comprises:

acquiring channel attention information of the second convolutional layer;

acquiring spatial attention information of the second convolutional layer;

8. An image processing apparatus, characterized in that the apparatus comprises:

9. An image processing apparatus, characterized in that the apparatus comprises:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the image processing method according to any one of claims 1 to 7.

10. A storage medium storing computer instructions for performing the image processing method according to any one of claims 1 to 7 when the computer instructions are called.