CN115908109A

CN115908109A - Facial image stylized model training method, equipment and storage medium

Info

Publication number: CN115908109A
Application number: CN202211367061.2A
Authority: CN
Inventors: 李凌志; 徐诗瑶; 申丽
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-04-04

Abstract

The application provides a face image stylized model training method, device and storage medium, which are used in the fields of virtual reality and augmented reality and also can be used in the field of virtual human modeling. The method comprises the following steps: and acquiring a face input image and a style sample image, and optimizing the first 2DGAN model according to a first hidden vector corresponding to the style sample image to obtain a second 2DGAN model. And determining a second hidden vector corresponding to the face input image and a third hidden vector corresponding to the face input image and the camera view angle of the face input image. And generating a first 2D stylized face image serving as the supervision information of the 3DGAN model to be trained according to the second hidden vector and the second 2DGAN model. And generating a 3D stylized face image according to the third hidden vector, the camera view angle and the 3DGAN model. A first loss function value is determined from the first 2D stylized face image and the 3D stylized face image to train the 3DGAN model based on the first loss function value.

Description

Facial image stylized model training method, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a face image stylized model training method, face image stylized model training equipment and a storage medium.

Background

With the popularization of various short video application programs, face stylization is more and more popular in the field of short videos at present, and particularly face cartoonization has wide application requirements in the application fields such as social networks and the like. The stylization refers to a process of applying an artistic style (style) to a picture.

Current face stylization algorithms are all based on two-dimensional (2D) images. Specifically, taking human face cartoon processing as an example, a cartoon human face image a and a real human face image B are given, and the style in the cartoon human face image a is migrated to the real human face image B in a 2D Generative confrontation network (GAN) model style migration manner, so as to obtain a human face image C after cartoon processing, where the human face image C is a 2D image.

The face generation method for stylizing in the 2D domain can only obtain 2D stylized face images, and has application limitation.

Disclosure of Invention

The embodiment of the invention provides a facial image stylized model training method, a device, equipment and a storage medium, which are used for training a 3D generation network model capable of generating a 3D stylized facial image.

In a first aspect, an embodiment of the present invention provides a method for training a facial image stylized model, where the method includes:

acquiring a 2D face input image and a style sample image;

optimizing a pre-trained first 2D generation network model according to a first hidden vector corresponding to the style sample image to obtain a second 2D generation network model, wherein the first hidden vector reflects style characteristics in the style sample image;

determining a second hidden vector corresponding to the face input image and a third hidden vector corresponding to the face input image and a camera view angle of the face input image, wherein the second hidden vector reflects face features in the face input image, and the second hidden vector reflects face features and pose features in the face input image;

generating a network model according to the second hidden vector and the second 2D, and generating a first 2D stylized face image serving as supervision information of a 3D generation network model to be trained;

generating a 3D stylized face image according to the third hidden vector, the camera view angle of the face input image and the 3D generation network model;

determining a first loss function value for training the 3D generation network model according to the first 2D stylized face image and the 3D stylized face image, so as to train the 3D generation network model based on the first loss function value; wherein the first 2D generation network model and the 3D generation network model are both neural network models.

In a second aspect, an embodiment of the present invention provides a device for training a facial image stylized model, where the device includes:

the obtaining module is used for obtaining a 2D face input image and a style example image;

the optimization module is used for optimizing a pre-trained first 2D generation network model according to a first hidden vector corresponding to the style sample image to obtain a second 2D generation network model, and the first hidden vector reflects style characteristics in the style sample image;

the training module is used for determining a second hidden vector corresponding to the face input image and a third hidden vector corresponding to the face input image and the camera view angle of the face input image; generating a first 2D stylized face image serving as supervision information of a 3D generation network model to be trained according to the second hidden vector and the second 2D generation network model; generating a 3D stylized face image according to the third hidden vector, the camera view angle of the face input image and the 3D generation network model; determining a first loss function value for training the 3D generation network model according to the first 2D stylized face image and the 3D stylized face image, so as to train the 3D generation network model based on the first loss function value; the first 2D generation network model and the 3D generation network model are both neural network models, the second hidden vector reflects the face features in the face input image, and the second hidden vector reflects the face features and pose features in the face input image.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the method of training a stylized model of a face image according to the first aspect.

In a fourth aspect, the present invention provides a non-transitory machine-readable storage medium, on which is stored executable code, when the executable code is executed by a processor of an electronic device, the processor is caused to execute the facial image stylized model training method according to the first aspect.

In a fifth aspect, an embodiment of the present invention provides a method for training a facial image stylized model, where the method includes:

receiving a request triggered by terminal equipment through calling a setting service, wherein the request comprises a 2D face input image and a style example image;

executing the following steps by utilizing the processing resource corresponding to the setting service:

generating a 2D stylized face image serving as supervision information of a 3D generation network model to be trained according to the second hidden vector and the second 2D generation network model;

determining a first loss function value for training the 3D generation network model according to the 2D stylized face image and the 3D stylized face image, so as to train the 3D generation network model based on the first loss function value; wherein the first 2D generation network model and the 3D generation network model are both neural network models;

and feeding back the trained 3D generation network model to the terminal equipment.

In a sixth aspect, an embodiment of the present invention provides a face image stylized model training method, which is applied to an augmented reality device, and the method includes:

displaying a 2D face input image and a style sample image;

determining a first loss function value for training the 3D generation network model according to the 2D stylized face image and the 3D stylized face image, so as to train the 3D generation network model based on the first loss function value; wherein the first 2D generation network model and the 3D generation network model are both neural network models.

The model training method provided by the embodiment of the invention is used for training a 3D generation network model capable of generating a three-dimensional (3D) stylized face image, and can automatically generate the 3D stylized face image presenting the stylized feature at an input visual angle aiming at a real face image and any visual angle based on the 3D generation network model with a certain stylized feature. Specifically, the 3D generated network model is trained by using a 2D generated network model, specifically, a guide for tuning a pre-trained first 2D generated network model to a second 2D generated network model of a style feature domain represented by a given style sample image based on a first hidden vector corresponding to the style sample image is needed. The guidance function of the second 2D generation network model is embodied as: and determining a second hidden vector corresponding to the face input image, and inputting the second hidden vector into a second 2D generation network model to generate a first 2D stylized face image serving as supervision information of the 3D generation network model to be trained. Thus, after determining a third hidden vector corresponding to the face input image and the camera view angle of the face input image, inputting the third hidden vector and the camera view angle into a 3D generated network model to generate a 3D stylized face image, training the 3D generated network model by calculating a first loss function value with the 3D stylized face image by using the first 2D stylized face image as supervision information, so that the trained 3D generated network model has the capability of generating stylized face images with 3D geometric shapes at different camera view angles.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a facial image stylized model training method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a 2DGAN model optimization process according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a third hidden vector optimization process of a face input image according to an embodiment of the present invention;

fig. 4 is an application schematic diagram of a facial image stylized model training method according to an embodiment of the present invention;

fig. 5 is an application schematic diagram of a facial image stylized model training method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a facial image stylized model training device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided in this embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the sequence of steps in the embodiments of the methods described below is merely an example, and is not strictly limited.

The model training method provided by the embodiment of the invention is used for training a 3D generation network model capable of generating a three-dimensional (3D) stylized face image, and can be called as a 3DGAN model. The 3DGAN model is a neural network model, for example, an Efficient Geometry-aware 3D-generated countermeasure network (EG 3D) model may be used. Wherein 3D perception, namely the generated stylized human face has a 3D shape (shape) conforming to the given style characteristics.

In the training process of the 3DGAN model, a human face input image and a style sample image are required to be used. The style example image may be a cartoon image showing an image of a cartoon character, and the face input image corresponds to a face image to be stylized. For example, in an application scenario of an online video conference, a participant wants to use a 3D image of the participant with the cartoon character image style as a virtual avatar, and at this time, a 2D image of the participant is used as the face input image.

The 3DGAN model trained in the embodiment of the invention can generate a corresponding 3D stylized face image aiming at the face input image, and can also generate a 3D stylized face image corresponding to the current input camera view angle based on the input different camera view angles, so that a 3D virtual image that the face of a participant can dynamically rotate can be presented on a video conference interface in the online video conference application scene, and the 3D normalized face image has more vivid dynamic effect.

Thus, in summary, the training of the 3DGAN model in the embodiments of the present invention is primarily intended to learn a 3D representation of a particular stylistic effect of an example stylistic image, given the example stylistic image and a face input image, so that a corresponding 3D stylized face image may be generated from any perspective.

The process of training the 3DGAN model is explained below.

Fig. 1 is a flowchart of a facial image stylized model training method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

101. a 2D face input image and a style sample image are acquired.

102. And optimizing the pre-trained first 2D generation network model according to the first hidden vector corresponding to the style sample image to obtain a second 2D generation network model.

103. And determining a second hidden vector corresponding to the face input image and a third hidden vector corresponding to the face input image and the camera view angle of the face input image.

104. And generating a first 2D stylized face image serving as supervision information of the 3D generation network model to be trained according to the second hidden vector and the second 2D generation network model, and generating a 3D stylized face image according to the third hidden vector, the camera view angle of the face input image and the 3D generation network model.

105. Determining a first loss function value for training the 3D generation network model according to the first 2D stylized face image and the 3D stylized face image, so as to train the 3D generation network model based on the first loss function value.

In the embodiment of the present invention, in order to train the 3DGAN model, a 2D generation network model, hereinafter referred to as 2DGAN, is needed. The 2DGAN model is a neural network model, such as a styleGAN model or a different improved version thereof.

In summary, the 2DGAN model mainly plays a guiding role in providing supervision information in the process of training the 3DGAN model. That is to say, the embodiment of the present invention proposes a new dual-branch architecture, in which one branch is a 2D branch, and the other branch is a 3D branch, under the 2D branch, an intermediate image, i.e. a first 2D stylized face image, is generated by the 2D DGAN model as a guide, so as to help the 3D DGAN model under the 3D branch to generate stylized face images with 3D geometry.

Specifically, a 2DGAN model pre-trained on a public training set is obtained, which is called a first 2DGAN model, then the first 2DGAN model is optimized and adjusted based on a given style example image to fine tune the model to the example style feature domain, so as to obtain an optimized second 2DGAN model, and then the second 2DGAN model is adopted to guide the training of the 3DGAN model.

First, a first hidden vector (latent code) corresponding to the style sample image may be obtained, and the first 2DGAN model may be optimized based on the first hidden vector. The first hidden vector reflects a style feature in the style sample image.

Specifically, the style sample image may be input into a set generation network (GAN) inversion model to obtain a first hidden vector corresponding to the style sample image through the GAN inversion model. The GAN inversion model is also a neural network model, for example, an encoder for editing (e 4 e) model is used to map the style sample image onto a hidden space (latent space).

And then, optimizing the pre-trained first 2DGAN model according to the first hidden vector corresponding to the style sample image to obtain a second 2DGAN model, and adjusting the initial first 2DGAN model to a style feature domain corresponding to the style sample image through the optimization training process, namely enabling the optimized second 2DGAN model to fully learn style features presented in the style sample image.

Briefly, the above optimization process is mainly as follows: and mixing a first hidden vector corresponding to the style sample image with a hidden vector which can be generated in the first 2DGAN model and is used for generating the image, so that the mixed hidden vector used for generating the image comprises style features presented in the style sample image.

After the optimization of the first 2DGAN model is completed, the second 2DGAN model obtained by the optimization can be used to assist in completing the training of the 3DGAN model.

Specifically, a second hidden vector reflecting the face features and a third hidden vector reflecting the face features and pose features are extracted from the face input image. Inputting the second hidden vector into a second 2DGAN model to obtain a generated image: a first 2D stylized face image. Inputting the third hidden vector and the camera view angle of the face input image into a 3DGAN model to obtain a generated image: 3D stylized face images.

And calculating a first loss function value of a training 3DGAN model by taking the first 2D stylized face image as supervision information and based on the difference between the 3D stylized face image and the first 2D stylized face image, and adjusting the parameters of the 3DGAN model based on the first loss function value.

Compared with the traditional scheme that stylized migration is realized in a 2D domain, the embodiment of the invention realizes the purpose of migrating style characteristics in style sample images to input face images in a 3D domain so as to obtain 3D stylized face images. Moreover, in order to simplify the training of the 3DGAN model capable of achieving the purpose and ensure the model performance, compared with the traditional method that a large number of training sample images are required to be used when the 3DGAN model is trained, the scheme provided by the embodiment of the invention only needs to prepare one style sample image and one face input image, and a plurality of 2D stylized face images used for supervising the training of the 3DGAN model can be automatically generated based on the 2DGAN model optimized to the feature domain corresponding to the style sample image. The number of sample images that need to be collected manually is small. Moreover, the trained 3D generation network model has the capability of generating stylized face images with 3D geometric shapes under different camera view angles.

In an optional embodiment, the optimizing the pre-trained first 2DGAN model according to the first hidden vector corresponding to the style sample image to obtain the second 2DGAN model may be implemented as: and acquiring a fourth hidden vector for generating the image through the first 2DGAN model, and performing optimization training on the first 2DGAN model according to a mixed hidden vector of the first hidden vector and the fourth hidden vector corresponding to the style sample image to obtain a second 2DGAN model.

As described above, the 2DGAN model may adopt a styleGAN model, and the 2DGAN model includes two parts, namely, a mapping network (mapping network) and a synthesis network (synthesis network), and its working process is simply: a random hidden vector is given to be input into a mapping network, an intermediate hidden vector is output through the processing of a plurality of fully-connected layers included in the mapping network, and the hidden vector is input into different network layers in a synthesis network respectively after affine transformation, so that the synthesis network finally outputs a generated image.

In this case, the fourth hidden vector actually corresponds to the hidden vector output by the mapping network, so that the synthesis network generates an image based on the hidden vector.

After the fourth hidden vector is obtained, the first hidden vector and the fourth hidden vector corresponding to the style sample image are mixed through a certain set mixing mode to obtain a mixed hidden vector, and the first 2DGAN model is subjected to optimization training based on the mixed hidden vector to obtain a second 2DGAN model.

Specifically, the hybrid hidden vector may be input into the first 2DGAN model to obtain a second 2D stylized face image output by the first 2DGAN model, and then a second loss function value for optimizing the first 2DGAN model is determined according to the second 2D stylized face image and the style example image, so as to perform optimization training on the first 2DGAN model according to the second loss function value.

For ease of understanding, the optimization process of the first 2DGAN model is exemplarily described in connection with fig. 2.

As shown in FIG. 2, assume a stylistic example image y _e After being input into the e4e model, a first implicit vector w is output _e . Generating n random hidden vectors: (z) ₁ ,z ₂ ,…,z _n ) Sequentially and respectively inputting the n hidden vectors into the mapping network in the first 2DGAN model to obtain n corresponding fourth hidden vectors: (s) ₁ ,s ₂ ,…,s _n ) Then, a first hidden vector w _e Respectively mixing the n fourth hidden vectors with the n fourth hidden vectors to obtain n mixed hidden vectors: (w) ₁ ,w ₂ ,…,w _n ) Then, the n mixed hidden vectors are sequentially input into the synthetic network in the first 2DGAN model, and the generated n images are output: y is _e1 ’、y _e2 ’、…、y _en ’。

Then, based on the n generated images and the style sample image y respectively _e And calculating a loss function value, and performing gradient updating by back propagation based on the calculated loss function value so as to adjust the network parameters of the first 2DGAN model and realize optimization of the network parameters.

Wherein optionally the first hidden vector w may be performed based on the following formula _e And a fourth latent vector s _i Mixing:

w _i ＝α*w _e +(1-α)*s _i wherein alpha is a set constant coefficient.

The second loss function may be configured to constrain the image generated based on the hybrid latent vector to have a minimum difference from the stylistic example image.

In this way, through the optimization training process, the first 2DGAN model can be adjusted to the style feature domain corresponding to the style sample image, i.e., the second 2DGAN model after optimization can sufficiently learn the style features presented in the style sample image.

After the second 2DGAN model is obtained, for a given face input image at present, a corresponding first 2D stylized face image is generated through the second 2DGAN model, and in short, style features in the style sample image are migrated into the face input image through the second 2DGAN model.

Specifically, the GAN inversion model may be adopted to map the face input image to a hidden space to obtain a second hidden vector corresponding to the face input image, and then the second hidden vector is input to the second 2DGAN model, specifically, the second hidden vector is input to a synthesis network in the second 2DGAN model to obtain an output first 2D stylized face image. First, the 2D stylized face image is a 2D face image, and the 2D face image has been migrated to the style features presented by the style sample image, i.e. the face input image has been stylized with the style features, and is therefore referred to as a 2D stylized face image.

So far, model optimization and supervision information generation of the 2D branch are completed: the first 2DGAN model is optimized to learn a second 2DGAN model of the style characteristics in the style sample image, and the face input image is stylized based on the second 2DGAN model, so that a first 2D stylized face image serving as monitoring information is obtained. Thereafter, based on the supervisory information, the 3DGAN model in the 3D branch may be trained.

Aiming at the training of the 3DGAN model, firstly, a third hidden vector corresponding to the face input image and the camera view angle of the face input image is determined, and then the third hidden vector and the camera view angle of the face input image are input into the 3DGAN model to be trained to obtain the output 3D stylized face image. Then, a first loss function value for training the 3DGAN model is determined based on the first 2D stylized face image and the 3D stylized face image as the supervision information, so as to train the 3DGAN model based on the first loss function value, i.e., to adjust parameters of the 3DGAN model.

The third implicit vector can be obtained through an inversion process. Alternatively, the third hidden vector can be obtained by the GAN inversion model above.

The 3DGAN models such as EG3D models need to take a camera view angle into consideration when generating 3D images, because the nerve radiation field in the EG3D model needs to be rendered in conjunction with the camera view angle when performing volume rendering (volume rendering). Also, the input of the EG3D model includes, in addition to the camera view, a hidden vector for generating an image. Therefore, when training the 3DGAN model, first, a third hidden vector corresponding to the current face input image needs to be generated.

The generation process of the third implicit vector can be optionally obtained by an inversion process using the 3DGAN model again. In addition, in order to more accurately control the details of the image generated by the 3DGAN model, a random noise vector for controlling the details of the generated image may be further incorporated in the third hidden vector generation process.

Based on this, optionally, determining a third latent vector corresponding to the face input image and the camera angle of view of the face input image may be implemented as:

initializing a third hidden vector and a noise vector, and executing the following iteration process to obtain the third hidden vector when an iteration cutoff condition is met:

inputting a third hidden vector and a noise vector corresponding to the current iteration round and a camera view angle of the human face input image into the 3D generation network model to be trained so as to obtain a human face output image corresponding to the current iteration round;

and determining a third loss function value according to the face output image and the face input image corresponding to the current iteration turn, so as to train a 3DGAN model based on the third loss function value and update a third hidden vector and a noise vector.

For the sake of understanding, the above-mentioned process of obtaining the third hidden vector is illustrated with reference to fig. 3.

As shown in fig. 3, a camera view angle (pos) corresponding to the face input image X may be determined based on a certain set camera parameter recognition algorithm, and two parameters to be optimized are set: third latent vector w _3d And a noise vector ns. The third implicit vector w can be continuously optimized through multiple rounds of iterative training based on the 3DGAN model _3d And the noise vector ns is summed to obtain a third implicit vector w when the iteration cutoff condition is met _3d And a noise vector ns. A third implicit vector w obtained when an iteration cutoff condition is satisfied _3d Is used for the next stage of training of the 3DGAN model.

Specifically, the third hidden vector and the noise vector may be initialized randomly: w is a _3d0 And ns ₀ In the first round of iterative training process, w is adjusted _3d0 、ns ₀ And inputting a 3DGAN model by the camera view angle (pos), wherein the 3DGAN model generates a human face output image Y1, the human face output image Y1 is 3D, then calculating a third loss function value according to the human face output image Y1 and the human face input image X, and performing gradient updating through back propagation based on the third loss function value to update parameters of the 3DGAN model and a third hidden vector and a noise vector: suppose to be updated to w _3d1 And ns ₁ 。

Then, in the second round of iterative training process, w is adjusted _3d1 、ns ₁ And inputting a 3DGAN model by a camera view angle (pos), wherein the 3DGAN model generates a face output image Y2, then calculating another third loss function value according to the face output image Y2 and the face input image X, and performing gradient updating through back propagation based on the third loss function value to update the parameters of the 3DGAN model, a third hidden vector and a noise vector: suppose the update is w _3d2 And ns ₂ . And then performing next iterative training based on the updated third hidden vector and the noise vector. And the rest can be done in the same way until the iteration cutoff condition is met. The iteration cutoff condition may be that the number of iterations reaches a set valueThe fixed number of times, or the third loss function value is below a set threshold.

In an alternative embodiment, the third penalty function may be defined as a Learned Perceptual Image block Similarity (LPIPS) penalty, also referred to as Perceptual penalty, for the face output Image and the face input Image X.

In another alternative embodiment, the third Loss function may include two kinds of losses, i.e., L2 Loss of the face output image and the face input image X and Identity Loss (Identity Loss) of the face output image and the face input image X, in addition to the LPIPS Loss. Thus, these three kinds of loss function values can be calculated, and the result of weighted sum calculation according to the set weighting coefficients is taken as the final third loss function value.

Continuously optimizing the third implicit vector w _3d In fact, the parameters of the 3DGAN model are updated by training in one stage, and the optimized third implicit vector w is obtained _3d Then, the 3DGAN model is trained in the next phase.

At this time, the optimized third implicit vector w _3d And the camera view angle (pos) is input into a 3DGAN model, and the 3DGAN model finally outputs a 3D stylized human face image. Based on the 3D stylized face image and a first 2D stylized face image as supervisory information, a first loss function value is calculated to update parameters of the 3DGAN model based on the first loss function value.

In an alternative embodiment, the first loss function may be defined as a Mean Square Error (MSE) loss of the first 2D stylized face image and the 3D stylized face image.

In another alternative embodiment, the first loss function may include two kinds of perceptual losses, namely, a first perceptual loss of the first 2D stylized face image and the 3D stylized face image, and a second perceptual loss of the 3D stylized face image and the style sample image, in addition to the MSE loss.

Based on this, the calculation of the first loss function value comprises: determining a mean square error loss value of the first 2D stylized face image and the 3D stylized face image; determining a first perception loss value of the first 2D stylized face image and the 3D stylized face image; determining a second perception loss value of the 3D stylized face image and the style sample image; and determining a first loss function value according to the mean square error loss value, the first perception loss value and the second perception loss value. For example, the weighted sum calculation of the set weighting coefficients is performed on the three loss values, and the calculation result is used as the first loss function value.

After the training of the 3DGAN model is completed through the training process, a 3D stylized face image under any target camera view angle can be generated according to the given face input image based on the trained 3DGAN model. Specifically, after receiving the view angle of the target camera, the third hidden vector and the view angle of the target camera may be input into the trained 3DGAN model to obtain a 3D stylized face image of the target under the view angle of the target camera.

In the scheme provided by the embodiment of the invention, a 3DGAN model is used to enable the model to have the capability of capturing 3D information naturally, and a 2DGAN model is introduced to serve as a bridge between a 2D stylized face image and the 3D stylized face image, and the 3DGAN model is trained by following the training result of the 2DGAN model (the 2D stylized face image generated by the DGAN model is used as monitoring information), so that the corresponding 3D stylized face generator is successfully adjusted, and the effect of accelerating convergence is achieved.

Fig. 4 is an application schematic diagram of a face image stylized model training method provided in the embodiment of the present invention. In fig. 4, taking an online video conference application scenario as an example, an online video conference is started in a terminal device of a certain user, and the user may set a face image of the user in the online video conference APP in advance: the face in the figure is input into an image X, and a style example image y shown in the figure is selected in a style example image library _e Then, a 3DGAN model corresponding to the style sample image and the face input image is trained according to the training process described above.

In particular, style examples are determined by an e4e modelImage y _e Corresponding hidden vector w _e Based on the implicit vector w _e Mapping network generated hidden vector s with pre-trained 2DGAN model _i Performing style mixing processing to mix the hidden vector w _i Inputting the 2DGAN model to obtain a 2D stylized face image y _e '. From the 2D stylized face image y _e ' with style example image y _e And determining a Loss function value Loss1 so as to optimize and train the 2DGAN model.

Then, aiming at the face input image X, determining a hidden vector w corresponding to the face input image X through an e4e model _2d Inputting the optimized 2DGAN model and outputting a 2D stylized human face image g (w) _2d )。

Thereafter, a hidden vector w corresponding to the face input image X and its corresponding camera view angle (position = c) is determined _3d To hide the vector w _3d Inputting the 3DGAN model according to the camera view angle c to obtain a 3D stylized human face image g (w) which is output by the 3DGAN model and corresponds to the camera view angle c _3d And c) in sequence. In FIG. 4, g (w) is easily perceived _3d And c) is a 3D image, associated with which the corresponding 3D shape is illustrated. Stylize the face image g (w) according to 2D _2d ) And 3D stylized face image g (w) _3d And c) calculating a Loss function value Loss2 to train the 3DGAN model.

As shown in fig. 4, after training the 3DGAN model, the user can input different camera view angles, such as the different yaw angles illustrated in the figure: and then, by using the trained 3DGAN model, 3D stylized face images at corresponding different viewing angles shown in the figure can be obtained. Therefore, the effect of dynamic change of the facial morphology can be presented in the video conference interface.

The facial image stylized model training method provided by the embodiment of the invention can be executed at the cloud end, a plurality of computing nodes (cloud servers) can be deployed at the cloud end, and each computing node has processing resources such as computation, storage and the like. In the cloud, a plurality of computing nodes may be organized to provide a certain service, and of course, one computing node may also provide one or more services. The way that the cloud provides the service may be to provide a service interface to the outside, and the user calls the service interface to use the corresponding service.

According to the scheme provided by the embodiment of the invention, the cloud end can be provided with a service interface of a setting service (facial image stylized model training service), and a user calls the service interface through the terminal equipment to trigger a facial image stylized model training request to the cloud end, wherein the request comprises a style example image and a facial input image. The cloud determines the compute nodes that respond to the request, and performs the following steps using processing resources in the compute nodes:

optimizing the pre-trained first 2D generation network model according to the first hidden vector corresponding to the style sample image to obtain a second 2D generation network model;

determining a second hidden vector corresponding to the face input image and a third hidden vector corresponding to the face input image and a camera view angle of the face input image;

The above implementation process may refer to the related descriptions in the foregoing other embodiments, which are not described herein.

For ease of understanding, the description is exemplified in conjunction with fig. 5. The user may invoke the facial image stylization model training service through the terminal device E1 illustrated in fig. 5 to upload 2D facial input images and style sample images. The service Interface for the user to call the service includes Software Development Kit (SDK), application Programming Interface (API), and the like. Illustrated in fig. 5 is the case of an API interface. In the cloud, as shown in the figure, it is assumed that a service cluster E2 provides a facial image stylized model training service, and the service cluster E2 includes at least one computing node. After receiving the request, the service cluster E2 executes the steps described in the foregoing embodiments to obtain a trained 3D generation network model, and feeds back the trained 3D generation network model to the terminal device E1.

The terminal equipment E1 generates a network model based on the received 3D, can input any target camera view angle aiming at the face input image, and can generate a corresponding 3D stylized face image under the target camera view angle based on the 3D generated network model.

The facial image stylized model training apparatus of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means can each be constructed using commercially available hardware components and by performing the steps taught in this disclosure.

Fig. 6 is a schematic structural diagram of a facial image stylized model training apparatus provided in an embodiment of the present invention, and as shown in fig. 6, the apparatus includes: the system comprises an acquisition module 11, an optimization module 12 and a training module 13.

And an obtaining module 11, configured to obtain a 2D face input image and a style sample image.

And the optimization module 12 is configured to optimize the pre-trained first 2D generation network model according to a first hidden vector corresponding to the style sample image to obtain a second 2D generation network model, where the first hidden vector reflects style features in the style sample image.

A training module 13, configured to determine a second hidden vector corresponding to the face input image and a third hidden vector corresponding to the face input image and a camera angle of the face input image; generating a network model according to the second hidden vector and the second 2D, and generating a first 2D stylized face image serving as supervision information of a 3D generation network model to be trained; generating a 3D stylized face image according to the third hidden vector, the camera view angle of the face input image and the 3D generation network model; determining a first loss function value for training the 3D generation network model according to the first 2D stylized face image and the 3D stylized face image, so as to train the 3D generation network model based on the first loss function value; the first 2D generation network model and the 3D generation network model are both neural network models, the second hidden vector reflects the face features in the face input image, and the second hidden vector reflects the face features and pose features in the face input image.

Optionally, the optimization module 12 is configured to: and respectively inputting the style example image input images into a generated network inversion model so as to obtain the first hidden vector through the generated network inversion model.

Optionally, the training module 13 is configured to: and inputting the face input image to generate a network inversion model, so as to obtain the second hidden vector through the generated network inversion model, wherein the generated network inversion model is a neural network model.

Optionally, the optimization module 12 is configured to: acquiring a fourth hidden vector for generating an image through the first 2D generation network model; and performing optimization training on the first 2D generation network model according to a mixed hidden vector of the first hidden vector and the fourth hidden vector corresponding to the style sample image to obtain a second 2D generation network model.

The optimization module 12 is configured to: inputting the mixed hidden vector into the first 2D generation network model to obtain a second 2D stylized face image output by the first 2D generation network model; and determining a second loss function value for optimizing the first 2D generation network model according to the second 2D stylized face image and the style example image, so as to perform optimization training on the first 2D generation network model according to the second loss function value.

Optionally, the training module 13 is configured to: initializing a third implicit vector and a noise vector, and executing the following iteration process to obtain the third implicit vector when an iteration cutoff condition is met: inputting a third hidden vector and a noise vector corresponding to the current iteration round and a camera view angle of the human face input image into the 3D generation network model to be trained to obtain a human face output image corresponding to the current iteration round; and determining a third loss function value according to the face output image and the face input image corresponding to the current iteration round, so as to train the 3D generation network model based on the third loss function value and update the third hidden vector and the noise vector.

Optionally, the training module 13 is configured to: determining a mean square error loss value of the first 2D stylized face image and the 3D stylized face image; determining a first perception loss value of the first 2D stylized face image and the 3D stylized face image; determining a second perceptual loss value for the 3D stylized face image and the style sample image; and determining the first loss function value according to the mean square error loss value, the first perception loss value and the second perception loss value.

Optionally, the apparatus further comprises: a generation module to receive a target camera perspective; inputting the third hidden vector and the target camera view angle into the trained 3D generation network model to obtain a target 3D stylized face image under the target camera view angle.

The apparatus shown in fig. 6 can perform the steps in the foregoing embodiments, and the detailed performing process and technical effects refer to the descriptions in the foregoing embodiments, which are not described herein again.

In one possible design, the structure of the training apparatus for facial image stylized model as shown in fig. 6 may be implemented as an electronic device. As shown in fig. 7, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, causes the processor 21 to implement at least the facial image stylized model training method as provided in the preceding embodiments.

In an optional embodiment, the electronic device may be an augmented reality device. The augmented reality device may perform the following method:

displaying a 2D face input image and a style sample image;

optimizing a pre-trained first 2D generation network model according to a first hidden vector corresponding to the style sample image to obtain a second 2D generation network model;

Optionally, the augmented reality device may further: and displaying the 2D stylized face image and/or the 3D stylized face image.

Optionally, the method further comprises:

receiving a target camera view;

determining a fifth latent vector corresponding to the face input image and the target camera perspective;

inputting the fifth hidden vector and the target camera view angle into the trained 3D generation network model to obtain a target 3D stylized face image under the target camera view angle;

and displaying the target 3D stylized face image.

In practical applications, the augmented reality device may include, for example, a virtual reality device and an augmented reality device. In some practical application scenarios, for example, an application program in the virtual reality device provides a service for a user to generate a virtual person corresponding to the application program, multiple style sample images may be set in the application program, the user selects a favorite style sample image from the style sample images, collects a face image of the user, migrates style features in the selected style sample image onto the face image of the user, generates a 3D stylized face image, and displays the generated 3D stylized face image on an interactive interface of the application program. For example, the application program is a game program, and a user can generate a 3D stylized face image of the user to present in a 3D game scene.

In addition, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the facial image stylized model training method as provided in the foregoing embodiment.

The above-described apparatus embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A facial image stylized model training method is characterized by comprising the following steps:

acquiring a 2D face input image and a style example image;

generating a first 2D stylized face image serving as supervision information of a 3D generation network model to be trained according to the second hidden vector and the second 2D generation network model;

2. The method of claim 1, wherein optimizing the pre-trained first 2D generation network model according to the first hidden vector corresponding to the style example image to obtain a second 2D generation network model comprises:

acquiring a fourth hidden vector for generating an image through the first 2D generation network model;

and performing optimization training on the first 2D generation network model according to a mixed hidden vector of the first hidden vector and the fourth hidden vector corresponding to the style sample image to obtain the second 2D generation network model.

3. The method of claim 2, wherein the optimally training the first 2D generated network model according to the hybrid hidden vector of the first hidden vector and the fourth hidden vector corresponding to the style sample image comprises:

inputting the mixed hidden vector into the first 2D generation network model to obtain a second 2D stylized face image output by the first 2D generation network model;

and determining a second loss function value for optimizing the first 2D generation network model according to the second 2D stylized face image and the style example image, so as to perform optimization training on the first 2D generation network model according to the second loss function value.

4. The method of claim 1, wherein determining a third hidden vector corresponding to the face input image and a camera perspective of the face input image comprises:

initializing a third implicit vector and a noise vector, and executing the following iteration process to obtain the third implicit vector when an iteration cutoff condition is met:

and determining a third loss function value according to the face output image and the face input image corresponding to the current iteration round, so as to train the 3D generation network model based on the third loss function value and update the third hidden vector and the noise vector.

5. The method of any of claims 1 to 4, wherein determining a first loss function value for training the 3D generation network model from the first 2D stylized face image and the 3D stylized face image comprises:

determining a mean square error loss value of the first 2D stylized face image and the 3D stylized face image;

determining a first perceptual loss value of the first 2D stylized face image and the 3D stylized face image;

determining a second perception loss value of the 3D stylized face image and the style sample image;

and determining the first loss function value according to the mean square error loss value, the first perception loss value and the second perception loss value.

6. The method of claim 1, further comprising:

receiving a target camera view;

inputting the third hidden vector and the target camera view angle into the trained 3D generation network model to obtain a target 3D stylized face image under the target camera view angle.

7. The method of claim 1, further comprising:

and respectively inputting the style sample image and the face input image to generate a network inversion model, so as to respectively obtain the first hidden vector and the second hidden vector through the generated network inversion model, wherein the generated network inversion model is a neural network model.

8. A facial image stylized model training method is characterized by comprising the following steps:

receiving a request triggered by a terminal device through calling a setting service, wherein the request comprises a 2D face input image and a style example image;

9. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the facial image stylized model training method of any one of claims 1 to 7.

10. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of a cloud server, causes the processor to perform the facial image stylization model training method of any one of claims 1 to 7.

11. A facial image stylized model training method is applied to an augmented reality device, and comprises the following steps:

displaying a 2D face input image and a style sample image;

12. The method of claim 11, further comprising:

and displaying the 2D stylized face image and/or the 3D stylized face image.

13. The method of claim 11, further comprising:

receiving a target camera view;

and displaying the target 3D stylized face image.