CN115049556A

CN115049556A - StyleGAN-based face image restoration method

Info

Publication number: CN115049556A
Application number: CN202210736142.9A
Authority: CN
Inventors: 陈鹏; 刘亚特; 郑春厚; 章军; 夏懿; 梁栋; 黄林生; 王兵; 王刘向; 章瑜真
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-13

Abstract

The application discloses a human face image repairing method based on StyleGAN, which comprises the following steps: dividing a real face image into a face area and a background area as a training set; performing data enhancement on the data set, and setting an original image as a label; training the encoder by using the training set and the label to obtain an encoder network; respectively extracting an implicit code vector of a real face image, an implicit code vector of a face area of an image to be restored and an implicit code characteristic diagram of a background area of the image to be restored by utilizing an encoder network; the hidden code vector of the real face image and the hidden code vector of the face area of the image to be repaired are mixed to obtain the hidden code vector of the mixed face, and the hidden code vector of the mixed face and the hidden code characteristic diagram of the background area of the image to be repaired are input into a StyleGAN generator network together to obtain the repaired face image. The method and the device have the advantages that the face image restoration capability is greatly improved, and the structural similarity is well guaranteed in the restoration process.

Description

StyleGAN-based face image restoration method

Technical Field

The application relates to the field of computer vision, in particular to a human face image restoration method based on StyleGAN.

Background

In recent years, the quality of images generated by a countermeasure network (genetic adaptive Networks) is remarkably improved, and particularly, the prior art can randomly generate high-quality face images through a neural network. The most advanced generation of the anti-network StyleGAN obtains the most advanced visual quality on a high-resolution image, and in addition, the StyleGAN has a potential space W in which attribute de-entanglement can be carried out, and a face image is randomly generated by randomly sampling in the space W. The real image is embedded into the W space, namely, the hidden code vector of the real image is obtained and then input into a generator network of StyleGAN, so that a reconstruction result can be obtained. The existing research finds that the real image is embedded into the expanded W + space, and a finer reconstructed image can be obtained. Two methods are mainly used for embedding the real image into the W + space, wherein one method obtains the optimal reconstructed image by continuously optimizing the hidden code vector; in another method, a hidden code vector is obtained by one-time forward propagation through a coder method, so that a reconstruction result is obtained. Because the generator model of the StyleGAN contains rich face image information, the image restoration can be completed by using the face prior information in the generator. Meanwhile, the StyleGAN utilizes the hidden code vector to control and generate content, and the hidden code vector is input to different layers in the StyleGAN generator network, so that the generation results of different scales can be controlled.

The current face image restoration technology usually uses a preset algorithm, the difference between the reconstructed result and the original image is large, the structural similarity can not be well ensured in the restoration process, the texture and the luster of real skin can not be given, the overall effect is not ideal, and the restoration work is inconvenient. The traditional restoration method relies on boundary information and texture features of an image to be restored, and the methods are generally based on a mathematical principle, so that the capability of generating information is poor, and the robustness and the universality are poor. In conclusion, the face image restoration method has a larger promotion space.

Disclosure of Invention

By providing the StyleGAN-based face image restoration method, the technical problems that in the prior art, the difference between a reconstructed result and an original image is large, and structural similarity may not be well guaranteed in a restoration process are solved, the face image restoration capability is greatly improved, and structural similarity is well guaranteed in the restoration process.

The embodiment of the application provides a method for repairing a human face image based on StyleGAN, which comprises the following steps: dividing a real face image into a face area and a background area, and using the face area and the background area as a training set; performing data enhancement on the data set by using horizontal overturning, and setting an original image as a label; training an encoder by using the training set and the label to obtain an encoder network; respectively extracting an implicit code vector of a real face image, an implicit code vector of a face area of an image to be restored and an implicit code characteristic diagram of a background area of the image to be restored by utilizing the encoder network; and mixing the hidden code vector of the real face image and the hidden code vector of the face area of the image to be repaired to obtain a hidden code vector of a mixed face, and inputting the hidden code vector of the mixed face and the hidden code feature map of the background area of the image to be repaired into a StyleGAN generator network together to obtain the repaired face image.

Further, training an encoder using the training set and the labels, comprising the steps of: and coding the image, namely dividing the face region and the background region into two parts for coding, wherein aiming at the face region, an encoder structure combining ResNet50 and an SE attention module is utilized to code the input face region image, so as to obtain a hidden code vector of the face part. Extracting features of the background by using a convolutional neural network aiming at the background area to obtain a hidden code feature map of the background part; reconstructing an image, inputting the hidden code vectors of the human face part and the background part into a StyleGAN2 generator to obtain a reconstructed image; and (4) optimizing the encoder, namely calculating the L2 distance between pixels, the perception similarity score and the L2 distance of the human face identity characteristic according to the label image and the reconstructed image, and optimizing the encoder network to obtain the trained encoder network.

Further, the encoder structure of ResNet50 in combination with SE attention module is used to extract the hidden code vector of the face region image.

Furthermore, the dimension of the face hidden code vector is 18 × 512, and the dimension of the background hidden code feature map is 512 × 64.

Further, the encoder is optimized by utilizing three loss functions; wherein the first loss function is to calculate an L2 distance between the image label and the generated image from the pixel values; the second loss function is to respectively extract the deep characteristic information of the image tag and the generated image by using a VGG16 neural network, and calculate the distance L2 between the deep characteristic information of the image tag and the deep characteristic information of the generated image; the third loss function is to extract face feature information between the image label and the generated image by using a face recognition neural network, and calculate an L2 distance according to the face features of the image label and the generated image.

Further, the hidden code vector of the real face image and the hidden code vector of the face area of the image to be restored are mixed according to the proportion of 8: 10.

One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:

1. due to the adoption of the encoder method, the reconstruction work of the damaged image can be completed by one-time forward transmission, and the speed is high; meanwhile, the repairing method utilizes rich human face prior knowledge in StyleGAN, so that the repairing details of the five sense organs are more accurate and real.

2. Because the accurate repair of the damaged face image is realized through the pre-trained model, the true skin texture and luster can be given to the image.

Drawings

Fig. 1 is a flowchart of a method for repairing a human face image based on StyleGAN in an embodiment of the present application;

FIG. 2 is a flow chart of encoder training in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a face image restoration method in the embodiment of the present application.

Detailed Description

The embodiment of the application discloses a StyleGAN-based face image restoration method, which solves the technical problems that in the prior art, the difference between a reconstructed result and an original image is large, and the structural similarity can not be well guaranteed in the restoration process.

In view of the above technical problems, the technical solution provided by the present application has the following general idea: dividing a real face image into a face area and a background area, and using the face area and the background area as a training set; performing data enhancement on the data set by using horizontal overturning, and setting an original image as a label; training an encoder by using the training set and the label to obtain an encoder network; respectively extracting an implicit code vector of a real face image, an implicit code vector of a face area of an image to be restored and an implicit code characteristic diagram of a background area of the image to be restored by utilizing the encoder network; and mixing the hidden code vector of the real face image and the hidden code vector of the face area of the image to be repaired to obtain a hidden code vector of a mixed face, and inputting the hidden code vector of the mixed face and the hidden code feature map of the background area of the image to be repaired into a StyleGAN generator network together to obtain the repaired face image.

In order to make the above-mentioned basic method of the embodiments of the present application more comprehensible, specific embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a method for repairing a human face image based on StyleGAN in an embodiment of the present application, which is described in detail below through specific steps.

And S1, dividing the real face image into a face area and a background area, and using the face area and the background area as a training set.

In a specific implementation, the real face image can be segmented into a face region and a background region through a semantic segmentation network.

In a specific implementation, the photo in the data set of the real face image is a self-portrait photo of a real-world person, and the data set of the real face image is obtained after collection.

In a specific implementation, in the face region image, we fill the missing background portion with RGB (0,0,0), and in the background region image, we fill the missing face portion with RGB (0,0, 0).

In the specific implementation, a large number of real face images are used for training the StyleGAN, and a StyleGAN generator model capable of stably generating diversified face images is trained.

S2, data enhancement is performed on the data set by horizontal flipping, and the original image is set as a label.

In a specific implementation, the undivided original image may be taken as the label image.

And S3, training the encoder by using the training set and the label to obtain an encoder network.

In a specific implementation, as shown with reference to fig. 2, training may be performed by:

and S31, encoding the image, and dividing the face region and the background region into two parts for encoding, wherein aiming at the face region, an encoder structure combining ResNet50 and an SE attention module is utilized to encode the input face region image to obtain a hidden code vector of the face part. And aiming at the background area, extracting features of the background by using a convolutional neural network to obtain a hidden code feature map of the background part.

In specific implementation, for an encoder network for processing a face region, a structure of combining ResNet50 and an SE attention module can be used, 23 convolution blocks are provided in total, each convolution block comprises a BatchNormal layer, a two-dimensional convolution layer, a LeakyReLU activation function and an SE attention module, and an input is connected with an output of the SE module after being subjected to maximum pooling, and the jump connection structure improves information circulation and effectively avoids the problem of gradient disappearance caused by too deep network.

And the feature map f1 output by the 6 th convolution block, the feature map f2 output by the 20 th convolution block, and the feature map f3 output by the 23 th convolution block may be extracted, and added and connected by upsampling, and converted into feature maps c1, c2, and c3, where c1 is f3, c2 is upsample (c1) + f2, and c3 is upsample (c2) + f1, the shallow features contain more detailed information, and the deep features focus more on the whole and do not focus on image details. The network structure of the feature pyramid is used for fusing deep-layer features and shallow-layer features, so that the global features and semantic information of the image can be kept while the detailed information is concerned.

In a specific implementation, a network module for constructing a signature transformation hidden code vector is composed of a two-dimensional convolution, a LeakyReLU activation function and a full connection layer, and processes c1, c2 and c3 respectively, so that a signature c1 is transformed into a hidden code vector of 3 x 512 dimensions, a signature c2 is transformed into a hidden code vector of 4 x 512 dimensions, and a signature c3 is transformed into a hidden code vector of 11 x 512 dimensions. And splicing the obtained hidden code vectors to obtain the final hidden code vector of 18 x 512 dimensions.

In the specific implementation, for the encoder network for processing the background area image, the method uses the convolution block which is the same as the human face encoder, and because the background is processed into the hidden code feature map, only the 6-layer convolution block is used for processing the background image. Each convolution block comprises a BatchNormal layer, a two-dimensional convolution layer, a ReLU activation function and an SE attention module, the input of each convolution block is connected with the output of the SE module after being subjected to maximum pooling, and a background area image is processed into a 512 x 64 dimensional hidden code feature map through a background encoder network.

S32, reconstructing the image, inputting the hidden code vector of the human face part and the background part into a StyleGAN2 generator, and obtaining the reconstructed image.

In a specific implementation, the output of the encoder network is connected to the input of the StyleGAN network. And connecting the output of the face image encoder with the input of the StyleGAN generator, and fusing the output of the background image encoder with the feature map of the intermediate layer of the StyleGAN generator. And inputting the 18 x 512-dimensional hidden code vectors output by the face image encoder into different layers in a StyleGAN generator, and controlling the face generation effect of different scales. And (3) performing weighted fusion on the 512 x 64 output by the background image encoder and the characteristic map of the interlayer of the StyleGAN generator, and realizing accurate reconstruction of the background by inhibiting and enhancing certain regions of the characteristic map of the interlayer of the generator.

In a specific implementation, when an encoder is trained, the weight of the StyleGAN generator network is fixed, and the encoder is optimized by calculating loss by using an image generated by the StyleGAN generator and a preset label image.

The encoder is optimized for measuring the similarity between the generated image and the label image and using the similarity to calculate the loss. The overall loss function is L, the function is composed of three loss functions, the first loss function is the mean square error L between the image label and the generated image calculated according to the pixel value _mse . The second loss function is to use VGG16 neural network to extract the deep feature information of the image label and the generated image respectively, and calculate the mean square error L between the two deep feature information _lpips . The third loss function is to extract image objects by using a face recognition neural networkFace feature information between the sign and the generated image, and a mean square error L is calculated according to the face features of the sign and the generated image _id 。

L _mse ＝‖I-G(E(I))‖ ₂

L _lpips ＝‖LPIPS(I)-LPIPS(G(E(I)))‖ ₂

L _id ＝‖ID(I)-ID(G(E(I)))‖ ₂

Where I is the input image, E is the trained encoder network, and G is the trained StyleGAN generator network. LPIPS is a pre-trained VGG16 network and is used for extracting deep features of images and calculating the perception similarity of the two images. The ID is a pre-trained face recognition network used for extracting the identity characteristics of the face in the image.

The total loss function is L _total L _total ＝λ _mse L _mse +λ _lpips L _lpips +λ _id L _id

Wherein L is _mse Is the mean square error, λ, between the pixel values of the two images _mse The weight coefficient of the loss is 1.0. L is _lpips Is the mean square error, λ, of the deep features of the two images _lpips 0.8 is the weight coefficient of the loss. L is _id Is the mean square error, lambda, of the facial features of the two images _id 0.5 is the weight coefficient of the loss.

And S33, optimizing the encoder, calculating the distance L2 between pixels, the perception similarity score and the distance L2 of the human face identity characteristic according to the label image and the reconstructed image, and optimizing the encoder network to obtain the trained encoder network.

In a specific implementation, the batch size can be set to be 8, the iteration number is 30 ten thousand, and the learning rate is 1 e-4. According to the batch size of 8, 8 samples are taken out from a real face image every time, a semantic segmentation algorithm is utilized to obtain a face image and a background image of the 8 samples, the face image and the background image are respectively input into a face encoder network and a background encoder network to obtain a corresponding hidden code vector and a hidden code characteristic image, the hidden code vector and the hidden code characteristic image are input into a StyleGAN generator to obtain a generated image, forward propagation is completed, loss is calculated through a well-set loss function and weight, and the face encoder network and the background encoder network are optimized through backward propagation.

And S4, respectively extracting the hidden code vector of the real face image, the hidden code vector of the face region of the image to be repaired and the hidden code characteristic diagram of the background region of the image to be repaired by using the encoder network.

In specific implementation, as shown in fig. 3, a face recognition library Dlib may be used to perform face key point positioning on an image to be restored, cut the image to obtain a face image to be restored, and then use a semantic segmentation algorithm to divide the face image to be restored into a face region image and a background region image.

And S5, mixing the hidden code vector of the real face image with the hidden code vector of the face area of the image to be repaired to obtain a hidden code vector of a mixed face, and inputting the hidden code vector of the mixed face and the hidden code feature map of the background area of the image to be repaired into a StyleGAN generator network together to obtain the repaired face image.

In the specific implementation, the hidden code vector of the human image to be restored and the hidden code vector of the real human face image are mixed to obtain a mixed hidden code vector, the mixing ratio is 8:10, the former 8 x 512 dimensions in the hidden code vector of the human face to be restored and the latter 10 x 512 dimensions in the hidden code vector of the real human face are used for splicing and being called as a new 18 x 512-dimensional hidden code vector. Because the StyleGAN realizes the generation of different face images by controlling the hidden code vectors, different dimensions in the hidden code vectors control the generation of image effects with different dimensions. The mixing ratio was set to 8:10, under the condition of fully considering the human face prior information contained in the StyleGAN generator network, the information of the rough human face five sense organs style, appearance and the like in the human face image to be restored is retained.

And inputting the mixed hidden code vector and the background hidden code characteristic diagram of the image to be repaired into a StyleGAN generator network, and outputting a reconstructed image. Because the background in each picture is unique, the hidden code vector is used for simultaneously storing the face information and the background information, the burden is heavy, the face image and the background image are separately processed, and the hidden code feature image is used for separately storing the background information, so that the reconstruction of diversified background information is facilitated.

In summary, by adopting the style gan-based face image restoration method, the restoration of facial features, skin, texture and luster of the image to be restored is ensured while the identity information of the face to be restored is maintained. The method comprises the steps of firstly training a StyleGAN generator to obtain rich face priori knowledge, secondly enabling an encoder network to enable the encoder to accurately express face information and background information through hidden code vectors and characteristic graphs by setting pixel-level loss, integral perception similar loss and face attribute similar loss on an image, reconstructing the image under the dual control of the hidden code vectors and the hidden code characteristic graphs, wherein the reconstructed image has facial feature and appearance information of a face image to be repaired, and can increase skin luster and texture.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a system for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction system which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A human face image restoration method based on StyleGAN is characterized by comprising the following steps:

dividing a real face image into a face area and a background area, and using the face area and the background area as a training set;

performing data enhancement on the data set by using horizontal overturning, and setting an original image as a label;

training an encoder by using the training set and the label to obtain an encoder network;

respectively extracting an implicit code vector of a real face image, an implicit code vector of a face area of an image to be restored and an implicit code characteristic diagram of a background area of the image to be restored by utilizing the encoder network;

and mixing the hidden code vector of the real face image and the hidden code vector of the face area of the image to be repaired to obtain a hidden code vector of a mixed face, and inputting the hidden code vector of the mixed face and the hidden code feature map of the background area of the image to be repaired into a StyleGAN generator network together to obtain the repaired face image.

2. A styligan-based human face image inpainting method as claimed in claim 1, wherein training the encoder with said training set and said label comprises the steps of:

and coding the image, namely dividing the face region and the background region into two parts for coding, wherein aiming at the face region, an encoder structure combining ResNet50 and an SE attention module is utilized to code the input face region image, so as to obtain a hidden code vector of the face part. Extracting features of the background by using a convolutional neural network aiming at the background area to obtain a hidden code feature map of the background part;

reconstructing an image, inputting the hidden code vectors of the human face part and the background part into a StyleGAN2 generator to obtain a reconstructed image;

and (4) optimizing the encoder, namely calculating the L2 distance between pixels, the perception similarity score and the L2 distance of the human face identity characteristic according to the label image and the reconstructed image, and optimizing the encoder network to obtain the trained encoder network.

3. The style gan-based human face image inpainting method as claimed in claim 2, wherein the encoder structure of ResNet50 in combination with SE attention module is used to extract the hidden code vector of the human face region image.

4. The method as claimed in claim 2, wherein the dimension of the face hidden code vector is 18 × 512, and the dimension of the background hidden code feature map is 512 × 64.

5. The method as claimed in claim 2, wherein the encoder is optimized by using three loss functions; wherein the first loss function is to calculate an L2 distance between the image label and the generated image from the pixel values; the second loss function is to respectively extract the deep characteristic information of the image tag and the generated image by using a VGG16 neural network, and calculate the distance L2 between the deep characteristic information of the image tag and the deep characteristic information of the generated image; the third loss function is to extract face feature information between the image label and the generated image by using a face recognition neural network, and calculate an L2 distance according to the face features of the image label and the generated image.

6. The method as claimed in claim 1, wherein the hidden code vector of the real face image and the hidden code vector of the face region of the image to be restored are mixed in a ratio of 8: 10.