Abstract
Face reshaping aims to adjust the shape of a face in a portrait image to make the face aesthetically beautiful, which has many potential applications. Existing methods 1) operate on the pre-defined facial landmarks, leading to artifacts and distortions due to the limited number of landmarks, 2) synthesize new faces based on segmentation masks or sketches, causing generated faces to look dissatisfied due to the losses of skin details and difficulties in dealing with hair and background blurring, and 3) project the positions of the deformed feature points from the 3D face model to the 2D image, making the results unrealistic because of the misalignment between feature points. In this paper, we propose a novel method named face shape transfer (FST) via semantic warping, which can transfer both the overall face and individual components (e.g., eyes, nose, and mouth) of a reference image to the source image. To achieve controllability at the component level, we introduce five encoding networks, which are designed to learn feature embedding specific to different face components. To effectively exploit the features obtained from semantic parsing maps at different scales, we employ a straightforward method of directly connecting all layers within the global dense network. This direct connection facilitates maximum information flow between layers, efficiently utilizing diverse scale semantic parsing information. To avoid deformation artifacts, we introduce a spatial transformer network, allowing the network to handle different types of semantic warping effectively. To facilitate extensive evaluation, we construct a large-scale high-resolution face dataset, which contains 14,000 images with a resolution of 1024 × 1024. Superior performance of our method is demonstrated by qualitative and quantitative experiments on the benchmark dataset.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Facial beautification plays an important role in our social lives. People with beautiful faces have many advantages in social activities such as dating and voting. As an important task of facial beautification, face reshaping aims to beautify the shapes of portrait faces in images, which allows users to customize the beautification of individual facial features. However, most methods focus on using image processing software such as Adobe Photoshop to reshape the face image, which is still laborious, intensive, and time-consuming. Generating visually pleasing results with proper shape distortions usually requires professional artistic skills with subjective aesthetics, which is a challenging task for common users without any reference information. Therefore, it is of great value to propose a reference-based method to automatically generate a reasonable shape transformation to allow users to manipulate the results with user controls flexibly.
These existing face-reshaping methods can be roughly classified into two groups: 2D-image-based methods and 3D-face-model-based methods. 2D-image-based methods, such as image morphing, derive from image-processing techniques and image analyses [1, 2]. For example, numerous works [3, 4] have presented an image-morphing-based method to reshape or clone facial expressions. Because such methods only edit facial landmarks to achieve different operations in the 2D image, they can only deal with frontal faces which the landmarks are easy to obtain. On the other hand, 3D-model-based methods can introduce more information than 2D-image-based methods. For example, numerous works [5–7] utilize 3D face shapes estimated by 2D images and edit the face on the 3D model. However, the performance of 3D face models is minimal because of the global control of their parameters, which cannot control individual components [8].
To overcome the aforementioned problems, we propose a novel method named face shape transfer (FST) via semantic warping, which is capable of transferring both the overall face and individual components (e.g., eyes, nose, and mouths) of the reference image to the source image while preserving the facial structure and personal identity. As shown in Fig. 1, we utilize the cycle consistency strategy [9] and an encoder-decoder network [10] to model the shape transformation. Additionally, to enable component-level controllability, we introduce five encoding networks to learn the feature embedding for five face components (e.g., left eye, right eye, nose, mouth and skin) from the semantic parsing results, which aim to preserve the original structures of each component to the greatest extent. Then, to efficiently utilize the features from different scale semantic parsing maps, we adopt an intuitive way to connect all layers in the global dense network directly, ensuring maximum information flow between layers. Finally, we introduce a spatial transformer network to allow flexible warping operations and a few loss functions are introduced to prevent ghosting artifacts and obvious distortions.
The main contributions of this paper are as follows:
-
1)
We propose a novel method named face shape transfer via semantic warping, which is capable of transferring both the overall face and individual components (e.g., eyes, nose, and mouths) of the reference image to the source image without the intermediate presentation (e.g., 3D morphable model and pre-defined landmarks) limitation of existing methods.
-
2)
We introduce a novel spatial transformer network with two innovative loss functions: coordinate-based reconstruction loss and facial component loss. These allow flexible warping operations and smoother translation of all pixels in the same semantic region.
-
3)
We contribute a large-scale and high-resolution face dataset. Both qualitative and quantitative experiments are performed on our dataset and another benchmark dataset to demonstrate the superiority of our method over other state-of-the-art methods.
2 Related work
2D-based method. Traditional 2D-image-based methods compute transformation distances from the reference face shape to that of the source based on facial landmarks. Since large-scale deformation is prone to distortion, some methods [11–13] build a facial image database of the source person to retrieve the most similar expression to the reference as the basis of deformation, and then warp and stitch existing patches together. Although these methods succeed in face mimicking for a specific source person, collecting and pre-processing such a large dataset for each source person is an expensive cost in practice. Recently, numerous methods [14–19] based on generative adversarial networks (GANs) [20] have been developed for face reshaping. However, most GAN-based methods require large image sets for training, and the results are generated based on similar examples which largely limits diversity and controllability. Moreover, the personal identities of source images can not be preserved, which is the main difference between our method and other GAN models.
3D-based method. Most 3D-based face reshaping methods usually reconstruct 3D face models from 2D images and then apply 3D model reshaping methods [21, 22]. Although the development of statistical shape models [23–25] and example-based models [26, 27] has facilitated the maturity of modeling technology, 3D face reconstruction with a wide range of poses and expressions remains a challenging ill-posed problem. These methods require a 3D morphable model for shape transformation simulation, but this process is time-consuming and costly, limiting its application.
Local editing method. Local editing methods [28–32] address the local editing (e.g., nose and background) as opposed to the most GAN-based image editing methods that modify the global appearance [33–35]. For example, Editing in Style [28] tries to identify the contribution of each channel in the style vectors to specific parts. Structured noise [29] replaces the learned constant from StyleGAN with an input tensor, which is a combination of local and global codes. Meanwhile, GANs are widely leveraged to learn how to map from a reference in the source domain to the target domain. Specifically for local editing, references are often referred to as semantic masks [30, 36] or hand-written sketches [19, 37]. In the context of semantic-guided facial image synthesis, SPADE [36] leverages the semantic information to modify the image decoder for better visual fidelity and texture-semantic alignment. SEAN [30] encoders the real images into the per-region style codes and manipulates them, but it requires pairs of images and segmentation masks for training. Recently, SofGAN [38] has been presented to use semantic volumes for 3D editable image synthesis. However, the interpretation of 3D geometry is still lacking, and considerable semantically labeled 3D scans are required for training semantic rendering. In addition, there is no mechanism for preserving the view consistency in the synthesized textures.
3 Method
3.1 Overall framework
Given a source image \(\boldsymbol{I}_{\mathrm{src}} \in \mathbb{R}^{3 \times H \times W}\) and reference image \(\boldsymbol{I}_{\mathrm{ref}} \in \mathbb{R}^{3 \times H \times W}\), where W and H are the width and height of the image, FST aims to transform the face shape of the reference image to the source image. As illustrated in Fig. 1, the inputs to FST are the semantic label masks of the source and reference images \(\boldsymbol{P}_{\mathrm{src}} \in \mathbb{R}^{C \times H \times W}\) and \(\boldsymbol{P}_{\mathrm{ref}} \in \mathbb{R}^{C \times H \times W}\) obtained using the face parsing network [39], where C is the number of face components (e.g., eyes, nose, mouth, etc). First, the local embedding network is used to learn embedding features from five face components. Then, the features from each component are fed into the global dense network, which can ensure maximum information flow between layers in the network. Finally, the spatial transformer network performs reshaping operations on \(\boldsymbol{P}_{\mathrm{src}}\) according to the decoding results to obtain the final result \(\boldsymbol{P}_{\mathrm{res}}\), which can also be performed on \(\boldsymbol{I}_{\mathrm{src}}\) to obtain \(\boldsymbol{I}_{\mathrm{res}}\).
3.2 Local embedding network
Since a global network is quite limited in learning and recovering all local details of each instance [40], we design a separate face component encoding strategy to preserve local details. To this end, we segment the foreground face image into five components according to the face mask, which can efficiently avoid interference from other face components when dealing with individual component transformation. For each face component, we use the corresponding auto-encoder network to learn its original structure, which preserves local relations and the global shape of the input data during embedding. Then, to better balance the accuracy and efficiency of the separate encoding, the input size of five components is determined by the maximum size of each. With separate auto-encoder networks, we can conveniently change facial components in the encoding results and recombine different components from different faces. Since pix2pixHD [41] trains an auto-encoder network to obtain the feature vector that corresponds to each instance in the image and guarantee the features that fit different face component shapes, we add a component-wise average pooling layer to the output of the encoder, which computes the average feature for the face component.
To verify whether the shape embedding features could capture meaningful facial structure information, we first apply the mean shift clustering method [42] to group face component shapes and then apply the t-SNE [43] scheme for visualization. Figure 2 demonstrates that face components within the same cluster share a similar facial structure while neighboring clusters are similar in certain semantic parts.
3.3 Global dense network
To provide more information about different instances of the same object category between the source mask \(\boldsymbol{P}_{\mathrm{src}}\) and reference mask \(\boldsymbol{P}_{\mathrm{ref}}\), we first concatenate the latent encoding vectors from the source embedding \(\boldsymbol{z}_{\mathrm{src}}\) and reference embedding \(\boldsymbol{z}_{\mathrm{ref}}\) individually, which develops a novel object representation to cope with the different components. Then, to efficiently utilize the features from different scale images, all layers in dense blocks and dense transition layers are connected directly, ensuring maximum information flow between layers in the network. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature maps to all subsequent layers. During the decoding process, we combine features by concatenating them before passing them into subsequent layers, which produces a discriminant and appropriate descriptor for the affine transformation process.
3.4 Spatial transformer network
The original spatial transformer networks [44] are developed to improve the object recognition performance by reducing the input geometry information. However, low-dimensional affine transformation [44] and homograph transformation [45] fail to meet the requirement of dense and complex deformations. Therefore, we introduce a spatial transformer network to predict warping parameters to enable shape transformation on \(\boldsymbol{I}_{\mathrm{src}}\).
Based on the decoded feature vectors from face components, the spatial transformer network predicts the dense correspondences denoted by a matrix \(\boldsymbol{C} \in \mathbb{R}^{2 \times H \times W}\). Specifically, \({\textit{C}}_{1, i, j}\) and \({\textit{C}}_{2, i, j}\) indicate the corresponding target positions when warping each pixel in the source image. To this end, the feature vectors are fed into the localization network to obtain the transformer parameter θ for the next calculation. During training, θ is updated by the facial component loss to reach the expected affine transformation matrix. Under the updated affine transformation matrix, we can easily obtain the output feature maps, which means that the pixel value is also available. After that, the parameterized sampling grid is prepared to obtain the coordinate relations between \(\boldsymbol{I}_{\mathrm{src}}\) and \(\boldsymbol{I}_{\mathrm{res}}\) via pixel matching. Finally, differentiable bilinear sampling operations are carried out on the source image to obtain the final results according to the coordinate relations that can not be derived from the former spatial transformer networks [44, 45].
3.5 Loss functions
To make \(\boldsymbol{P}_{\mathrm{res}}\) similar to \(\boldsymbol{P}_{\mathrm{ref}}\), our objective contains four terms: reconstruction loss for preserving the semantic information during shape transformation, adversarial loss for preserving the local structure, cycle-consistency loss for reducing unreasonable operations, and facial component loss for reducing the artifacts inside each facial part.
Reconstruction loss. We compute the widely-used \(\mathcal{L}_{1}\) loss [46] and perceptual loss [47] as our reconstruction loss \(\mathcal{L}_{\mathrm{rec}}^{\mathrm{p}}\) to preserve the global semantic information, which is defined as follows:
where ϕ is a pre-trained VGG-19 network [48], and \(\lambda _{\mathrm{l1}}\) and \(\lambda _{\mathrm{per}}\) denote the loss weights of \(\mathcal{L}_{1}\) loss and perceptual loss, respectively. However, the above reconstruction loss does not consider the position variance of the small facial component. For example, the function can easily calculate the difference between skin regions but fails to capture the changes in the left eye. Following Chu et al. [49], we compute the distance of the center point from the same component in \(\boldsymbol{P}_{\mathrm{res}}\) and \(\boldsymbol{P}_{\mathrm{ref}}\). First, we calculate the average coordinate \(\left (x^{c}, y^{c}\right )\) of each face component c and regard it as the central location as follows:
where P indicates parsing map. Then, the location-based reconstruction loss function, which consists of ten components from two images is
where \(\lambda _{\mathrm{l}}\) is equal to the pixel ratio in each face component of the whole image, \(x_{\mathrm{res}}^{c}\) and \(y_{\mathrm{res}}^{c}\) indicates the average coordinates in the final result, \(x_{\mathrm{ref}}^{c}\) and \(y_{\mathrm{ref}}^{c}\) indicates the average coordinates in the reference input and c indicates the component index. Finally, the full reconstruction loss is
where \(\lambda _{\mathrm{p}}\) and \(\lambda _{\mathrm{l}}\) are weights for the loss terms.
Adversarial loss. After preserving the global structure, we use the adversarial loss to address the local structure. In this work, we derive a similar function based on StyleGAN [50] from the face paring result. Specifically, we use adversarial learning to approximate the distribution of \(\boldsymbol{P}_{\mathrm{res}}\) to the reference image \(\boldsymbol{P}_{\mathrm{ref}}\) and calculate the adversarial loss \(\mathcal{L}_{\mathrm{a}}\). The training strategy is the same as that for the WGAN [51] model.
Cycle consistency loss. Following Chu et al. [49], we compute the cycle consistency loss to maintain the integrity of the original data during transformation, which is illustrated in Fig. 3. Specifically, \(\boldsymbol{P}_{\mathrm{res}}\) is input into the encoder network to obtain the encoding result \(\boldsymbol{z}_{\mathrm{res}}\). Next, we concatenate \(\boldsymbol{z}_{\mathrm{res}}\) and \(\boldsymbol{z}_{\mathrm{src}}\), feeding them into the global dense network to reconstruct the source parsing map, denoted as \(\boldsymbol{P}_{\mathrm{cyc}}\). Finally, we apply the pixel-wise reconstruction loss Eq. (1) to compute the loss between \(\boldsymbol{P}_{\mathrm{cyc}}\) and \(\boldsymbol{P}_{\mathrm{src}}\), referred to as \(\mathcal{L}_{\mathrm{cyc}}\).
Facial component loss. To further enhance the perceptually significant facial components, we introduce facial component loss for the left eye, right eye, nose and mouth. We first crop regions of interest with ROI alignment [52]. We train separate and small local discriminators for each region to distinguish whether the restore patches are real, pushing the patches close to the real facial component shapes. The above loss functions only focus on the face parsing result, which constrains the result to be consistent in the semantic space. Specifically, we build the coordinate map for \(\boldsymbol{P}_{\mathrm{src}}\) as \(\boldsymbol{M}_{\mathrm{src}} \in \mathbb{R}^{2 \times H \times W}\). Similarly, \(\boldsymbol{M}_{\mathrm{cyc}}\) can also be obtained. We add the spatial coordinate loss to ensure that the coordinates of different facial components are well aligned. The facial component loss is defined as follows:
where \(ROI\) represents the region of the facial component (e.g., left eye, right eye, nose and mouth), and \(D_{ROI}\) is the local discriminator for each region.
Since the constructed pixels may not be well aligned in the coordinate space, we add the spatial coordinate loss when computing the cycle consistency. We construct a coordinate map \(\boldsymbol{M}_{\mathrm{src}} \in \mathbb{R}^{2 \times H \times W}\), where \(\boldsymbol{M}_{\mathrm{src}}^{(i, j)}=(i, j)\) indicates the spatial coordinate. After obtaining the reconstructed \(\boldsymbol{P}_{\mathrm{cyc}}\), we convert it to a coordinate map \(\boldsymbol{M}_{\mathrm{cyc}}\). As \(\boldsymbol{M}_{\mathrm{cyc}}\) has already been mapped by the global dense network and may not be as well aligned as \(\boldsymbol{M}_{\mathrm{src}}\). Inspired by Ref. [49], we can minimize the distance by optimizing the following function:
This spatially-variant consistency loss in the coordinate space can constrain the per-pixel correspondence to be one-to-one and reversible, which reduces the artifacts inside each facial part.
Overall loss functions. The overall loss function for the proposed method is as follows:
where \(\lambda _{\mathrm{r}}\), \(\lambda _{\mathrm{a}}\), \(\lambda _{\mathrm{cyc}}\), \(\lambda _{\mathrm{fp}}\) and \(\lambda _{\mathrm{s}}\) are used to weigh the loss computed for different samples differently based on whether they belong to the majority or minority classes.
3.6 Datasets
With the development of mobile devices, the quality and resolution of images have been largely improved and enlarged, such that the existing benchmark datasets, except CelebAMask-HQ [53], struggle to meet our requirements. However, CelebAMask-HQ suffers from the problem that only a small number of images can be regarded as reference images, which restricts the generalizability of our method due to the obvious imbalance between source images and reference images. To further train our method, we construct a large-scale face dataset that contains 14,000 high-resolution face images from different countries and regions in Asia. The collected dataset comprises in-the-wild images, and the age spans from 5 to 50 years old, which covers the main age groups with a high demand for face editing. Meanwhile, the looks cover different levels, from ordinary people to celebrities, and exactly meet the requirements of different degrees of shape transformation. These images vary in resolution and visual quality, ranging from 32 × 32 to 5760 × 3840. Some show crowds of several people whereas others focus on the face of a single person. Thus, applying several image processing steps is necessary to ensure consistent quality and to center the image on the face region. To improve the overall image quality, we preprocess each JPG image using two pre-trained neural networks: a convolutional autoencoder trained to remove JPG artifacts in natural images and an adversarially-trained super-resolution network. To handle cases where the facial region extends outside the image, we employ padding and filtering to extend the dimensions of the images. Then, we select an oriented crop rectangle based on the facial landmark annotations, transform it to 4096 × 4096 pixels using bilinear filtering, and scale it to 1024 × 1024 resolution using a box filter. We perform the above processing for all 32,569 images. We further analyze the resulting 1024 × 1024 images to estimate the final image quality, sort the images accordingly, and discard all but the best 14,000 images.
4 Experiments
4.1 Experiment details
We implement our method via PyTorch and train the model with a single Nvidia RTX TITAN GPU. For the encoder, we use five basic dense blocks [54] to extract local features and then utilize a flattened layer to obtain a 128-dimension vector for each face component. The decoder consists of five symmetric dense blocks and dense transition layers with transposed convolution layers for upsampling. During the training process, we use the Adam optimizer [55] and let the batch size equal to 32. We set \(\lambda _{\mathrm{r}} = 200\), \(\lambda _{\mathrm{a}} = 1\), \(\lambda _{\mathrm{cyc}} = 1\) and \(\lambda _{\mathrm{s}} = 200\). Similar to CycleGAN [9], we set the initial learning rate to 0.0001, fix it for the first 40 epochs, and linearly decrease it for another 40 epochs.
The evaluation dataset CelebAMask-HQ contains 30,000 aligned facial images with the size of 1024 × 1024 and relates semantic segmentation labels with the size of 512 × 512. Each label in the dataset comprises 19 classes (e.g., skin, eyes, eyebrows and hair). In our experiments, five components are considered: eyes (left eye and right eye), nose, mouth (up lip, mouth and down lip), and skin. We obtain a rough version of face components from semantic segmentation labels by an image dilation operation, which is defined as mask images. We take 2000 images as the test set for performance evaluation, and all images are resized to 256 × 256. In our experiments, the input sizes of five face components are determined by the maximum size of each component. Specifically, we use 64 × 32, 64 × 32, 128 × 64, 64 × 128, and 256 × 256 for the left eye, right eye, nose, mouth, and skin, respectively, in our local embedding network. We use a single Nvidia RTX TITAN GPU, and the runtime on a 1024 × 1024 source image is 0.4 s, including 0.1 s for face parsing for both source and reference images and 0.3 s for shape transformation.
4.2 Overall face shape transfer
Qualitative evaluation. To demonstrate the superiority of our method, we compare the quality of sample results to several benchmark methods such as the parametric weight-change face reshaping method [22], MaskGAN [53], EditGAN [17] and DeepFaceEditing [19]. As demonstrated in Fig. 4, the 3D modeling method only reshapes the face contour region. These operations have two effects: first, they limit the reshaping diversity and global coordination of the results because there are no warping operations except in that region; second, the background can be easily affected because half of the operation region is performed on the background area. However, the error between 3D modeling and 2D projection is also unavoidable, and we can find many tiny distortions in the contour region of the face. MaskGAN fails to preserve the texture of the source image and performs poorly on large deformation. Meanwhile, EditGAN and DeepFaceEditing suffer from the artifacts and blurring between the face and the background. Our method outperforms those methods in both reality and fidelity. Due to the mask-guided operation, our method is not limited by the problem of large-scale shape transformation problem.
Quantitative evaluation. To demonstrate the effectiveness of the shape transformation method, we calculate the cosine similarity and structural similarity index measure (SSIM) between the edited and the reference masks, where the greater score represents higher similarity. We first select 600 pairs of faces from the testing set in CelebAMAsk-HQ, and each pair contains a source mask, a reference mask, and a modified mask, with operations performed on the whole face. The evaluation results are listed in Table 1 and the scores are the average of all the experiments. According to the scores of cosine similarity and SSIM from the two evaluation categories, our method is effective in performing the shape transformation because the edited mask is more similar to the reference mask compared with the source mask. In other words, our method successfully achieves the goal of transforming the reference face shape into that of the source. However, due to the differences in hairstyle and face orientation, the edited masks are not completely similar to the reference masks.
Personal identity preservation. When people reshape their faces, they want to beautify themselves and be recognized as the same person so as not to be too similar to the reference faces. Thus, our method finds a balance between two targets, which means that identity preservation ability is crucial. To evaluate the identity preservation ability, we conduct an additional face verification experiment via the person re-identification (Re-ID) method [56]. In the experimental setting, we first select 600 pairs of faces from the testing set in CelebAMAsk-HQ, and each pair contains an unmodified face and a modified face, where operations from different methods maintain the same level on the whole face. The results of the Re-ID accuracy are listed in Table 2, which shows that FST is more qualified to preserve the original identity than the other state-of-the-art face manipulation methods. To further demonstrate that FST achieves obvious shape transfer according to the reference image, we also introduce the Re-ID accuracy of the source image. As shown in Table 2, the Re-ID of the source images achieves a largely high accuracy, demonstrating that FST does not perform small manipulation for higher Re-ID accuracy. Additionally, we evaluate the inference speed and demonstrate that the encoder-decoder structure enhances the re-identity scores without the time-consuming expense. FST balances the preservation of source identity, the degree of face reshaping and efficiency.
User study. To further evaluate the image quality of the above-mentioned methods, we collect 2000 pairwise results from a total of 40 participants at the same time and environment to conduct a user study. For each subject, we first show a reference image as the instruction to guide the users. During the study, we randomly choose two of the methods and present one result for each method. We then ask each subject to select the one that “which results better reflect the target mask and can be recognized as the same person” in terms of the face component shape and global coordination. The results in Table 3 illustrate that FST performs favorably against state-of-the-art methods, which means that we achieve a high-quality face shape transfer and high-fidelity identity preservation simultaneously.
4.3 Individual component shape transfer
Qualitative evaluation. To demonstrate the ability to adjust facial components, we choose the mouth region as the target to compare the component editing results between different state-of-the-art methods. First, we directly replace the mask of the mouth, upper lip, and lower lip from the reference mask to the source mask. Figure 5 shows the visual results of FST and the other state-of-the-art methods. With the very opening reference mouth, MaskGAN [53], FENeRF [57] and DeepFaceEditing [19] all generate teeth in the middle of up lip and down lip. However, the limited performance makes the results unrealistic due to obvious distortions and color errors. FENeRF [57] loses all background information, and distortions occupy approximately the contour region. Considering that DeepFaceEditing [19] belongs to the image generation method, with the help of sketches, it should provide a high-quality results. However, lacking the ability to recover mouth details makes it fail in this task. Compared to the above methods, our method uses parsing masks to control the shape and thus produces better results with the preservation of facial structures and personal identities.
Quantitative evaluation. To measure the generation quality from different models, we introduce Fréchet inception distance (FID) [58] and sliced Wasserstein distance (SWD) [59] into the quantitative evaluation experiments. Table 4 demonstrates the comparison results when reshaping the mouth region. MaskGAN [53] has plausible results but sometimes cannot transfer the mouth shape from the source image accurately because it exchanges attributes from the source image in latent space. FENeRF [57] has a good score but fails in the mouth region because the performance of EditGAN [17] may be influenced by the size of the training data and network design. DeepFaceEditing [19] has an inferior reconstruction ability than other methods, as long as the target image does not have spatial information to learn a better mapping with the user-defined mask.
4.4 Ablation study
Loss functions. To demonstrate the superiority of the loss functions qualitatively, we randomly select 70 reference images, to guide the source image to perform shape transformation on the whole face. We provide visual comparisons in Fig. 6 to verify the effectiveness of the designed loss functions. Here, we gradually add \(\mathcal{L}_{\mathrm{r}}\), \(\mathcal{L}_{\mathrm{cyc}}\) and \(\mathcal{L}_{\mathrm{p}}\) to our training. We can find that using \(\mathcal{L}_{\mathrm{r}}\) leads to much more artifacts, e.g., around the eye and skin regions. This is because the method only learns the semantic difference and does not know the facial structure. After adding \(\mathcal{L}_{\mathrm{cyc}}\), there are a few constraints on the mapping functions and fewer distortions than before. The transformation results seem more well-aligned and visually pleasing when \(\mathcal{L}_{\mathrm{fp}}\) is added. The reason is that \(\mathcal{L}_{\mathrm{fp}}\) penalizes the distortions and artifacts and enforces all pixels in the same region to have a smoother contour.
Encoder networks. To demonstrate the excellence of the local embedding network, we randomly select 70 reference images, to guide the source image to perform shape transformation on the nose. We provide visual comparisons in Fig. 7 to verify the effectiveness of the designed encoder network. When using ResNet [60] to do global encoding, artifacts can not be avoided on either side of the nose. Because it is difficult to find the feature vectors about the nose accurately in the global encoding vectors. Although the artifacts are small, the impact on the final results is obvious since faces show complex multidimensional visual patterns and abnormal parts are easy to find. Therefore, the global encoding method cannot reshape individual face components without affecting other components. When using ResNet to perform local encoding, the face orientation becomes the most crucial factor influencing the final results. As shown in Fig. 7, if the face is turned slightly, the direction of the bridge of the nose is distorted. Compared with the two methods mentioned above, our method can learn its original structures, which can help to preserve facial structure.
Shape transformation networks. To demonstrate the excellence of the spatial transformer network, we randomly select 70 reference images to guide the source image to perform shape transformation on the whole face. We provide visual comparisons in Fig. 8 to verify the effectiveness of the designed shape transformation network. When using affine transformation or perspective transformation to reshape the face, distortions and artifacts can not be avoided and all pixels in the same region do not have a smoother contour, especially noses. Moreover, it is easy to find that our spatial transformer network successfully achieves denser and more accurate shape transformation operations on the source image than other shape transformation networks.
5 Conclusion and discussion
In this work, we propose a novel framework, face shape transfer called FST, a face reshaping framework that can obtain high-quality results, which overcomes the limitations of the shape transformation degree and precise intermediate presentation in existing methods. Through separate face component encoding networks, FST extracts the original structures of each component, which preserves the facial structure and personal identity of the source image. Meanwhile, the novel spatial transformer network with coordinate-based reconstruction loss and region-based facial component loss is introduced to transmit and fuse the features of source and reference images, further boosting the performance of face reshaping. In addition, we show that the learned embedding space of the semantic parsing map allows us to directly manipulate the parsing map and generate shape changes according to the preference of the user. The extensive experiments demonstrate that our framework can achieve state-of-the-art face-reshaping results with observable geometric changes.
Since our method struggles to handle shapeless attributes (e.g., hairstyle, skin color), it fails to handle complex editing tasks, such as changing the complexion or hairstyle. Additionally, due to the lack of corresponding data, obvious facial occlusion and rotation also pose great challenges to our method. To mitigate these shortcomings, we will focus on attribute disentanglement and eliminate dataset bias, leading to more robust and accurate predictions of face editing.
Data availability
The datasets collected during the current study are available from the corresponding author upon reasonable request.
Abbreviations
- FST:
-
face shape transfer
- GAN:
-
generative adversarial network
- SSIM:
-
structural similarity index measure
References
Suryanarayana, G. K., & Dubey, R. (2017). Image analyses of supersonic air-intake buzz and control by natural ventilation. Journal of Visualization, 20(4), 711–727.
Liu, L., Yu, H., Wang, S., Wan, L., & Han, S. (2021). Learning shape and texture progression for young child face aging. Signal Processing. Image Communication, 93, 116127.
Fan, X., Chai, Z., Feng, Y., Wang, Y., Wang, S., & Luo, Z. (2016). An efficient mesh-based face beautifier on mobile devices. Neurocomputing, 172, 134–142.
Zhang, J., Shan, S., Kan, M., & Chen, X. (2014). Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In D. J. Fleet, T. Pajdla, B. Schiele, et al. (Eds.), Proceedings of the 13th European conference on computer vision (pp. 1–16). Cham: Springer.
Alvarez, F. J. A., Parra, E. B. B., & Tubio, F. M. (2017). Improving graphic expression training with 3D models. Journal of Visualization, 20(4), 889–904.
Vlasic, D., Brand, M., Pfister, H., & Popovic, J. (2006). Face transfer with multilinear models. In J. W. Finnegan & D. Shreiner (Eds.), Proceedings of the international conference on computer graphics and interactive techniques (pp. 1–8). New York: ACM.
Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., & Nießner, M. (2016). Face2face: real-time face capture and reenactment of RGB videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2387–2395). Piscataway: IEEE.
Neumann, T., Varanasi, K., Wenger, S., Wacker, M., Magnor, M., & Theobalt, C. (2013). Sparse localized deformation components. ACM Transactions on Graphics, 32(6), 1–10.
Chang, H., Lu, J., Yu, F., & Finkelstein, A. (2018). Pairedcyclegan: asymmetric style transfer for applying and removing makeup. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 40–48). Piscataway: IEEE.
Yang, H., Huang, D., Wang, Y., & Jain, A. K. (2018). Learning face age progression: a pyramid architecture of GANs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 31–39). Piscataway: IEEE.
Leyvand, T., Cohen-Or, D., Dror, G., & Lischinski, D. (2008). Data-driven enhancement of facial attractiveness. ACM Transactions on Graphics, 27, 1–9.
Garrido, P., Valgaerts, L., Rehmsen, O., Thormahlen, T., Perez, P., & Theobalt, C. (2014). Automatic face reenactment. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4217–4224). Piscataway: IEEE.
Bitouk, D., Kumar, N., Dhillon, S., Belhumeur, P., & Nayar, S. K. (2008). Face swapping: automatically replacing faces in photographs. ACM Transactions on Graphics, 27, 1–8.
Gu, S., Bao, J., Yang, H., Chen, D., Wen, F., & Yuan, L. (2019). Mask-guided portrait editing with conditional GANs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3436–3445). Piscataway: IEEE.
Choi, Y., Uh, Y., Yoo, J., & Ha, J.-W. (2020). Stargan v2: diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8188–8197). Piscataway: IEEE.
Chen, Z., Wang, C., Yuan, B., & Tao, D. (2020). PuppeteerGAN: arbitrary portrait animation with semantic-aware appearance transformation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13518–13527). Piscataway: IEEE.
Ling, H., Kreis, K., Li, D., Kim, S. W., Torralba, A., & Fidler, S. (2021). EditGAN: high-precision semantic image editing. arXiv preprint. arXiv:2111.03186.
Abdal, R., Zhu, P., Mitra, N. J., & Wonka, P. (2021). Styleflow: attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics, 40(3), 1–21.
Chen, S.-Y., Liu, F.-L., Lai, Y.-K., Rosin, P. L., Li, C., Fu, H., et al. (2021). Deepfaceediting: deep face generation and editing with disentangled geometry and appearance control. ArXiv preprint. arXiv:2105.08935.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Liao, J., Lima, R. S., Nehab, D., Hoppe, H., Sander, P. V., & Yu, J. (2014). Automating image morphing using structural similarity on a halfway domain. ACM Transactions on Graphics, 33(5), 1–12.
Zhao, H., Jin, X., Huang, X., Chai, M., & Zhou, K. (2018). Parametric reshaping of portrait images for weight-change. IEEE Computer Graphics and Applications, 38(1), 77–90.
Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on computer graphics and interactive techniques (pp. 187–194). Piscataway: IEEE.
Tena, J. R., De la Torre, F., & Matthews, I. (2011). Interactive region-based linear 3D face models. ACM Transactions on Graphics, 30, 1–10.
Maleš, L., Marčetić, D., & Ribarić, S. (2019). A multi-agent dynamic system for robust multi-face tracking. Expert Systems with Applications, 126, 246–264.
Kemelmacher-Shlizerman, I., Shechtman, E., Garg, R., & Seitz, S. M. (2011). Exploring photobios. ACM Transactions on Graphics, 30(4), 1–10.
Hassner, T. (2013). Viewing real-world faces in 3d. In Proceedings of the IEEE international conference on computer vision (pp. 3607–3614). Piscataway: IEEE.
Collins, E., Bala, R., Price, B., & Susstrunk, S. (2020). Editing in style: uncovering the local semantics of gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5771–5780). Piscataway: IEEE.
Alharbi, Y., & Wonka, P. (2020). Disentangled image generation through structured noise injection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5134–5142). Piscataway: IEEE.
Zhu, P., Abdal, R., Qin, Y., & Wonka, P. (2020). Sean: image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5104–5113). Piscataway: IEEE.
Zhan, F., Zhu, H., & Lu, S. (2019). Spatial fusion GAN for image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3653–3662). Piscataway: IEEE.
Shocher, A., Gandelsman, Y., Mosseri, I., Yarom, M., Irani, M., Freeman, W. T., et al. (2020). Semantic pyramid for image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7457–7466). Piscataway: IEEE.
Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020). Interpreting the latent space of GANs for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9243–9252). Piscataway: IEEE.
Voynov, A., & Babenko, A. (2020). Unsupervised discovery of interpretable directions in the GAN latent space. In International conference on machine learning (pp. 9786–9796). Stroudsburg: International Machine Learning Society.
Park, T., Zhu, J.-Y., Wang, O., Lu, J., Shechtman, E., Efros, A., et al. (2020). Swapping autoencoder for deep image manipulation. In Proceedings of the 34th international conference on neural information processing systems (Vol. 33, pp. 7198–7211). Red Hook: Curran Associates.
Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2337–2346). Piscataway: IEEE.
Chen, S.-Y., Su, W., Gao, L., Xia, S., & Fu, H. (2020). DeepFaceDrawing: deep generation of face images from sketches. ACM Transactions on Graphics, 39(4), 72.
Chen, A., Liu, R., Xie, L., Chen, Z., Su, H., & Yu, J. (2022). SofGAN: a portrait image generator with dynamic styling. ACM Transactions on Graphics, 41(1), 1–26.
Chu, W., Hung, W.-C., Tsai, Y.-H., Cai, D., & Yang, M.-H. (2019). Weakly-supervised caricature face parsing through domain adaptation. In Proceedings of the 2019 IEEE international conference on image processing (pp. 3282–3286). Piscataway: IEEE.
Li, Z., Zhang, S., Zhang, Z., Meng, Q., Liu, Q., & Zhou, H. (2023). Attention guided domain alignment for conditional face image generation. Computer Vision and Image Understanding, 234, 103740.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8798–8807). Piscataway: IEEE.
Comaniciu, D., & Meer, P. (2002). Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619.
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605.
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In Proceedings of the 29th international conference on neural information processing systems (pp. 2017–2025). Red Hook: Curran Associates.
Lin, C.-H., Yumer, E., Wang, O., Shechtman, E., & Lucey, S. (2018). ST-GAN: spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9455–9464). Piscataway: IEEE.
Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., et al. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In B. Leibe, J. Matas, N. Sebe, et al. (Eds.), Proceedings of the 14thProceedings of the IEEE conference on computer vision and pattern recognition (pp. 4681–4690). Piscataway: IEEE.
Johnson, J., Alahi, A., & Li, F.-F. (2016). Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the 14th European conference on computer vision (pp. 694–711). Cham: Springer.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Y. Bengio & Y. LeCun (Eds.), Proceedings of the 3rd international conference on learning representations (pp. 1–14). San Diego, USA.
Chu, W., Hung, W.-C., Tsai, Y.-H., Chang, Y.-T., Li, Y., Cai, D., et al. (2021). Learning to caricature via semantic shape transform. International Journal of Computer Vision, 129(9), 2663–2679.
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4401–4410). Piscataway: IEEE.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214–223). Stroudsburg: International Machine Learning Society.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969). Piscataway: IEEE.
Lee, C.-H., Liu, Z., Wu, L., & Luo, P. (2019). MaskGAN: towards diverse and interactive facial image manipulation. arXiv preprint. arXiv:1907.11922.
Huang, R., Zhang, S., Li, T., & He, R. (2017). Beyond face rotation: global and local perception GAN for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE international conference on computer vision (pp. 2439–2448). Piscataway: IEEE.
Kingma, D. P., & Ba, J. (2015). Adam: a method for stochastic optimization. In Y. Bengio & Y. LeCun (Eds.), Proceedings of the 3rd international conference on learning representations (pp. 1–15). San Diego, USA.
Nousi, P., Papadopoulos, S., Tefas, A., & Pitas, I. (2020). Deep autoencoders for attribute preserving face de-identification. Signal Processing. Image Communication, 81, 115699.
Sun, J., Wang, X., Zhang, Y., Li, X., Zhang, Q., Liu, Y., et al. (2021). FENeRF: face editing in neural radiance fields. arXiv preprint. arXiv:2111.15490.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In I. Guyon, U. von Luxburg, S. Bengio, et al. (Eds.), Proceedings of the 31st international conference on neural information processing systems (pp. 6626–6637). Red Hook: Curran Associates.
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the 6th international conference on learning representations (pp. 1–26). Retrieved June 30, 2024, from https://openreview.net/forum?id=Hk99zCeAb.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint. arXiv:1512.03385.
Author information
Authors and Affiliations
Contributions
All the authors contributed to the network design and paper review. ZL, QL, XL, and WY performed the data collection and analysis. SZ and JL commented on previous versions of the manuscript. All the authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, Z., Lv, X., Yu, W. et al. Face shape transfer via semantic warping. Vis. Intell. 2, 26 (2024). https://doi.org/10.1007/s44267-024-00058-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44267-024-00058-7