CN111127378A

CN111127378A - Image processing method, image processing device, computer equipment and storage medium

Info

Publication number: CN111127378A
Application number: CN201911335708.1A
Authority: CN
Inventors: 朱圣晨
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-08
Also published as: WO2021129642A1

Abstract

The embodiment of the application discloses an image processing method and device, computer equipment and a storage medium, and belongs to the field of image processing. The method comprises the following steps: acquiring an image to be processed and a style image, wherein the style of a target human face part in the image to be processed is different from that of a target human face part in the style image; extracting a first partial image from the image to be processed and extracting a second partial image from the style image, wherein the first partial image and the second partial image comprise images of the target human face part; carrying out style migration on the first local image according to the second local image to obtain a target local image, wherein the style of the target local image is the same as that of a target human face part in the first local image; and fusing the target local image and the image to be processed to generate a target image. Compared with the prior art that only face beautifying parameter adjustment can be performed, the scheme provided by the embodiment of the application can be used for applying the style of the face part in other images to the image to be processed, so that the variety of face beautifying is improved, and customized face beautifying is realized.

Description

Image processing method, image processing device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing, and in particular, to an image processing method and apparatus, a computer device, and a storage medium.

Background

The beauty treatment is an image processing method for beautifying face images, and common beauty treatment modes comprise whitening, face thinning, large eye (eye enlargement), small mouth (mouth reduction) and the like.

In the related art, when the face image is beautified, a user can adjust the beautification parameters through a control. For example, the whitening parameters of the face can be adjusted through the whitening control; the magnification of the eye can be adjusted by the large eye control. However, the beauty function in the related art can only realize simple adjustment of beauty parameters, and the beauty function is single.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, computer equipment and a storage medium.

The technical scheme is as follows:

in one aspect, an image processing method is provided, and the method includes:

acquiring an image to be processed and a style image, wherein the style of a target face part in the image to be processed is different from that of the target face part in the style image;

extracting a first partial image from the image to be processed and extracting a second partial image from the style image, wherein the first partial image and the second partial image comprise images of the target human face part;

performing style migration on the first local image according to the second local image to obtain a target local image, wherein the style of the target local image is the same as that of the target face part in the first local image;

and fusing the target local image and the image to be processed to generate a target image.

In another aspect, there is provided an image processing apparatus, the apparatus including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an image to be processed and a style image, and the style of a target face part in the image to be processed is different from that of the target face part in the style image;

an extraction module, configured to extract a first partial image from the image to be processed, and extract a second partial image from the style image, where the first partial image and the second partial image include an image of the target face portion;

the style migration module is used for carrying out style migration on the first local image according to the second local image to obtain a target local image, wherein the style of the target local image is the same as that of the target face part in the first local image;

and the generating module is used for fusing the target local image and the image to be processed to generate a target image.

In another aspect, an embodiment of the present application provides a computer device, which includes a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the image processing method as described in the above aspect.

In another aspect, a computer-readable storage medium is provided, the storage medium storing at least one instruction for execution by a processor to implement the image processing method of the above aspect.

In another aspect, a computer program product is provided, which stores at least one instruction that is loaded and executed by a processor to implement the image processing method according to the above aspect.

In the embodiment of the application, after the image to be processed and the style image are obtained, a first local image and a second local image which comprise a target face part image are respectively extracted from the image to be processed and the style image, so that the first local image is subjected to style migration according to the second local image to obtain a target local image, and finally the target local image and the image to be processed are fused to generate the target image; compared with the prior art that only face beautifying parameter adjustment can be performed, the scheme provided by the embodiment of the application can be used for applying the style of the face part in other images to the image to be processed, so that the variety of face beautifying is improved, and customized face beautifying is realized.

Drawings

FIG. 1 illustrates a block diagram of a computer device provided in an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of an image processing method shown in an exemplary embodiment of the present application;

FIG. 3 is an interface diagram illustrating an implementation of an image processing method according to an exemplary embodiment;

FIG. 4 illustrates a flow chart of an image processing method shown in another exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a grayscale matrix and a clipping matrix provided by an exemplary embodiment;

FIG. 6 is a schematic diagram of a style migration process using a style migration network;

FIG. 7 is a network architecture diagram of an encoded network provided by an exemplary embodiment;

FIG. 8 is a block diagram of a convolutional layer in a coding network provided by an exemplary embodiment;

FIG. 9 is a network architecture diagram of a decoding network provided by an exemplary embodiment;

FIG. 10 is a schematic diagram illustrating an implementation of a style migration process for an image to be processed, in accordance with an illustrative embodiment;

FIG. 11 is a flow diagram illustrating a style migration network training process in accordance with an illustrative embodiment;

fig. 12 is a block diagram showing a configuration of an image processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Referring to fig. 1, a block diagram of a computer device 100 according to an exemplary embodiment of the present application is shown. The computer device 100 may be a smartphone, a tablet computer, a notebook computer, etc. The computer device 100 in the present application may include one or more of the following components: processor 110, memory 120, display 130.

Processor 110 may include one or more processing cores. The processor 110 interfaces with various components throughout the computer device 100 using various interfaces and lines to perform various functions of the computer device 100 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Neural-Network Processing Unit (NPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content to be displayed by the touch display screen 130; the NPU is used for realizing an Artificial Intelligence (AI) function; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a single chip.

In one possible implementation, in this embodiment, the steps related to the neural network may be performed by the NPU, the steps related to the image display may be performed by the GPU, and the steps related to the operation within the application may be performed by the CPU.

The Memory 120 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). Optionally, the memory 120 includes a non-transitory computer-readable medium. The memory 120 may be used to store instructions, programs, code sets, or instruction sets. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like; the storage data area may store data (such as audio data, a phonebook) created according to the use of the computer device 100, and the like.

The display screen 130 is a display component for displaying a user interface. Optionally, the display screen 130 further has a touch function, and a user can perform a touch operation on the display screen 130 by using any suitable object such as a finger, a touch pen, and the like through the touch function.

The display screen 130 is typically provided on the front panel of the computer device 100. The display screen 130 may be designed as a full-face screen, a curved screen, a contoured screen, a double-face screen, or a folding screen. The display 130 may also be designed as a combination of a full-screen and a curved-screen, and a combination of a non-flat screen and a curved-screen, which is not limited in this embodiment.

In one possible implementation, the computer device 100 further includes a camera assembly for capturing RGB images (such as an RGB camera), which may be a front camera or a rear camera of the computer device 100.

Optionally, in this embodiment of the application, when the computer device 100 is used to shoot, the camera assembly is in an open state, and image acquisition is performed, and when a trigger operation on the shutter control is received, the computer device 100 generates an image to be processed according to an image currently acquired by the camera assembly.

In addition, those skilled in the art will appreciate that the configuration of the computer device 100 illustrated in the above-described figures does not constitute a limitation of the computer device 100, and that a terminal may include more or less components than those illustrated, or may combine certain components, or a different arrangement of components. For example, the computer device 100 further includes a microphone, a speaker, a radio frequency circuit, an input unit, a sensor, an audio circuit, a Wireless Fidelity (WiFi) module, a power supply, a bluetooth module, and other components, which are not described herein again.

Referring to fig. 2, a flowchart of an image processing method according to an exemplary embodiment of the present application is shown. The present embodiment is illustrated with the method applied to the computer apparatus shown in fig. 1. The method comprises the following steps:

step 201, acquiring an image to be processed and a style image, wherein the style of a target face part in the image to be processed is different from that of the target face part in the style image.

The image to be processed and the style image both comprise images of a target face part, the image to be processed is an image needing style migration, the style image is a reference image when the style of the image to be processed is migrated, namely the style of the target face part in the style image is referenced when the style of the target face part in the image to be processed is migrated.

In one possible embodiment, the image to be processed and the stylistic image belong to different pictures, such as the image to be processed is a self-portrait picture, and the stylistic image is a star picture downloaded from a network; alternatively, the image to be processed and the style image belong to the same picture, for example, the image to be processed is the image of person a in the picture, and the style image is the image of person B in the picture.

Optionally, the target face part includes at least one of the following: eyes, mouth, ears, eyebrows, and nose.

Step 202, extracting a first partial image from the image to be processed, and extracting a second partial image from the style image, wherein the first partial image and the second partial image comprise images of the target human face.

In one possible implementation, when a style migration instruction is received and the style migration instruction indicates that style migration is performed on a target face part in an image to be processed, the computer device extracts a first partial image and a second partial image containing the target face part from the image to be processed and the style image respectively. Wherein the extraction areas of the first partial image and the second partial image are designated by a person, or are automatically divided by a computer device.

Since the first partial image (or the second partial image) only occupies a part of the image to be processed (or the style image), the subsequent style migration requires less computing resources and is faster.

In one illustrative example, when a style transition (e.g., cosmetic pupil processing) is required for an eye of a task in an image to be processed, a computer device extracts a first partial image and a second partial image including an eye image from the image to be processed and the style image.

And step 203, performing style migration on the first partial image according to the second partial image to obtain a target partial image, wherein the style of the target partial image is the same as that of the target face part in the first partial image.

Further, for the extracted first partial image and second partial image, the computer device adjusts the style of the target face part in the first partial image according to the style of the target face part in the second partial image, so that the style of the first partial image is more similar to that of the second partial image, and finally generates the target partial image.

The size of the target local image is the same as that of the first local image, the image content of the target local image refers to the first local image, and the image style of the target local image refers to the second local image, namely the target local image is fused with the content feature of the first local image and the style feature of the second local image.

And 204, fusing the target local image and the image to be processed to generate a target image.

After the target local image is obtained, the computer equipment further fuses the target local image and the image to be processed, so that the target image is generated. In one possible embodiment, the computer device replaces the first partial image in the image to be processed with the target partial image, resulting in the target image.

Compared with the related art, only the preset beauty parameters can be adjusted, in the embodiment of the application, the computer equipment can perform style migration on the corresponding face part in the image to be processed according to the style of the designated face part in the style image, the reference object of the style migration is not limited by the preset beauty parameters (can be an image containing the face part at will), and customized beauty can be realized; moreover, by extracting the local images and performing style migration on the local images, the whole face area does not need to be processed, so that the computing resources required by the style migration are reduced, the style migration efficiency is improved, and the image processing method can be applied to mobile terminals with weak computing power.

In an exemplary application scenario, the image processing method provided by the embodiment of the application may be applied to a beauty application program. As shown in fig. 3, in the running process of the beauty application, when it is necessary to refer to other face images to beautify a designated face portion in a designated image, a user selects a first photo 31 (i.e., an image to be processed) that needs to be beautified and a second photo 32 (i.e., a style image) with a desired beauty effect, and then clicks a photo selection determination key 33. The beauty application further displays a beauty part selection control on the user interface for the user to select the face part needing beauty (namely style migration). When the user selects the beauty part as the left eye and the right eye and clicks the beauty part determination button 34, the beauty application program performs the local image extraction, the style migration, and the image fusion processing by the above method, and finally generates and displays the target photograph 35.

To sum up, in the embodiment of the present application, after an image to be processed and a style image are acquired, a first partial image and a second partial image including a target face part image are respectively extracted from the image to be processed and the style image, so that the first partial image is subjected to style migration according to the second partial image to obtain a target partial image, and finally the target partial image and the image to be processed are fused to generate a target image; compared with the prior art that only face beautifying parameter adjustment can be performed, the scheme provided by the embodiment of the application can be used for applying the style of the face part in other images to the image to be processed, so that the variety of face beautifying is improved, and customized face beautifying is realized.

In a possible implementation mode, a five sense organ segmentation network and a style migration network are arranged in the computer device, and in the image processing process, the computer device extracts a local image containing a specified face part from the image through the five sense organ segmentation network and performs style migration on the extracted local image through the style migration network. The following description will be made by using exemplary embodiments.

Referring to fig. 4, a flowchart of an image processing method according to another exemplary embodiment of the present application is shown. The present embodiment is illustrated with the method applied to the computer apparatus shown in fig. 1. The method comprises the following steps:

step 401, acquiring an image to be processed and a style image, wherein the style of a target face part in the image to be processed is different from that of the target face part in the style image.

The step 201 may be referred to in the implementation manner of this step, and this embodiment is not described herein again.

For the image to be processed, the computer device performs partial image extraction by the following steps 402 to 404, and for the genre image, the computer device performs partial image extraction by the following steps 405 to 407.

Step 402, inputting the image to be processed into the facial features segmentation network to obtain a first gray matrix output by the facial features segmentation network, wherein the first gray matrix is used for representing the position of the target face part in the image to be processed.

In one possible embodiment, the computer device first adjusts the image to be processed to a specified size (corresponding to the image input size of the segmentation network for five sense organs, such as 256 × 256px), thereby inputting the image to be processed of the specified size into the segmentation network for five sense organs.

Optionally, the computer device reads Red-Green-Blue (RGB) values of each pixel point in the image to be processed, and generates an image matrix I to be processed according to the read RGB values_h,w,cThereby forming a matrix I of images to be processed_h,w,cInputting the five sense organs segmentation network to obtain a first gray matrix L output by the five sense organs segmentation network_h,w. Where h and w are the height and width of the image to be processed, and c is the number of channels of the image to be processed (the number of channels of the RGB image is 3).

Optionally, L_h,wThe values of the matrix elements are 0 and 1, wherein the position of the matrix element 1 is the position of the target face part in the image to be processed. Schematically, the first gray matrix is shown in fig. 5.

Optionally, the facial feature segmentation Network in the embodiment of the present application may be a Convolutional Neural Network (CNN), a full Convolutional neural Network (FCN), or another neural Network for performing image segmentation, which is not limited in the embodiment of the present application.

And step 403, cutting the first gray matrix to obtain a first cutting matrix, where the first cutting matrix is used to represent a minimum rectangular region containing the target face part.

After the first gray matrix is obtained through the above steps, since the size of the first gray matrix is the same as the size of the image to be processed and includes a large number of invalid elements (i.e., matrix elements corresponding to non-target face portions), if the first gray matrix is directly used for local image extraction, the waste of computing resources is caused, and therefore after the first gray matrix is obtained, the computer device also needs to perform matrix clipping on the first gray matrix to obtain the minimum matrix including the target face portions.

In one possible implementation, L is applied to the first gray matrix_h,wThe computer device cuts the matrix element 1 according to the left boundary, the right boundary, the upper boundary and the lower boundary of the first gray matrix to obtain a first cutting matrix L'_h,wThe first clipping matrix corresponds to a minimum rectangular area containing the target face portion.

Schematically, as shown in fig. 5, the computer device crops a 7 × 3 first crop matrix 51 from the first gray matrix.

In step 404, a first partial image is generated according to the first cropping matrix and the image to be processed.

In a possible implementation, the computer device processes the image matrix I according to the corresponding image to be processed_h,w,cAnd a first clipping matrix L'_h,wA first partial image is generated. Wherein the first partial image is generated

Representing correspondences in a matrixThe elements of the position are multiplied.

And 405, inputting the style image into the facial feature segmentation network to obtain a second gray matrix output by the facial feature segmentation network, wherein the second gray matrix is used for representing the position of the target face part in the style image.

And 406, cutting the second gray matrix to obtain a second cutting matrix, wherein the second cutting matrix is used for representing a minimum rectangular area containing the target face part.

Step 407, a second partial image is generated according to the second clipping matrix and the lattice image.

Similar to the implementation process of the above steps 402 to 404, the computer device extracts the second local image from the style image by using the five sense organs segmentation network, which is not described herein again.

It should be noted that there is no strict sequence between the steps 402 to 404 and the steps 405 to 407, that is, the steps 402 to 404 and the steps 405 to 407 may be executed synchronously, which is not limited in this embodiment of the application.

After the local image extraction is completed through the difference, the computer equipment inputs the local image into the style migration network, and the style of the first local image is converted into the style of the second local image by using the style migration network.

Optionally, the style migration network in the embodiment of the present application adopts a network structure of "encoding + decoding". As shown in fig. 6, the extracted first and second partial images 61 and 62 are first input to an encoding network 63 in the style transition network, and feature extraction is performed on the first and second partial images 61 and 62 by the encoding network 63, respectively, to obtain first and second image features 64 and 65 (the content features of the first and second partial images 61 and 62, respectively). Further, the computer device performs feature fusion on the first image features 64 and the second image features 65 to obtain target image features 66, inputs the target image features 66 into a decoding network 67, and performs image restoration according to the target image features 66 by a decoding network 76 to obtain a target local image 68.

Step 408, inputting the first local image into a coding network to obtain a first image feature output by the coding network, where the coding network is used to perform feature extraction on the input image.

In a possible implementation manner, in order to reduce the amount of calculation in the image style migration process and enable the image style migration to be performed in the mobile terminal, the coding network in the embodiment of the present application is based on a squashing network (squeezet). The SqueezeNet is used as a light convolutional neural network, can remarkably reduce the number of network parameters while ensuring the accuracy of feature extraction, and is suitable for mobile terminals with limited calculation performance.

The core structure of the SqueezeNet is a Fire Module (Fire Module), which contains 8 Fire modules in a complete SqueezeNet, i.e. 8 Fire layers, wherein the Fire modules at the upper layer are used for extracting low-level image features, and the Fire modules at the lower layer are used for extracting high-level image features, wherein the high-level image features are more abstract than the low-level image features.

If the complete SqueezeNet is directly used as the coding network, the abstraction degree of the image features finally output by the coding network is higher, and when the image features with higher abstraction degree are subsequently utilized for image restoration, the restoration difficulty is higher, and the quality of the finally restored image is poor.

Therefore, in order to reduce the difficulty of subsequent image restoration and improve the image restoration quality, in one possible implementation, the coding network includes a first convolution layer, a maximum pooling layer, and n Fire layers, where n is a positive integer less than or equal to 4. Because the number of the Fire layers is less, the characteristics of the finally output image can be prevented from being too abstract, and the improvement of the restoration of the subsequent image is facilitated.

In one illustrative example, the network structure of the encoded network is shown in FIG. 7. From top to bottom, the coding network comprises a first convolutional layer 71, a first max pooling layer (maxpool)72, a first Fire layer 73, a second Fire layer 74, a third Fire layer 75, a second max pooling layer 76 and a fourth Fire layer 77. The first convolution layer 71 is used to perform feature extraction on the input partial image, and the step size of the first maximum pooling layer 72 and the second maximum pooling layer 77 is 2.

Schematically, the structure of the first Fire layer 73, the second Fire layer 74, the third Fire layer 75 and the fourth Fire layer 77 is shown in fig. 7. Each Fire layer comprises a first extrusion (squeeze) module 781, a first expansion (expanded) module 782 and a second expansion module 783, wherein the first extrusion module 781 is used for feature dimension reduction, the first expansion module 782 and the second expansion module 783 are used for dimension increase of features output by the first extrusion module 781, and the features output by the first expansion module 782 and the second expansion module 783 are further spliced and output by a merging module 784. The first compressing module 781 adopts a convolution kernel of 1 × 1, the first expanding module 782 adopts a convolution kernel of 1 × 1, and the second expanding module 783 adopts a convolution kernel of 3 × 3.

In one possible embodiment, the computer device performs a Mirror filling (Mirror Padding) process on the first partial image before performing feature extraction on the first partial image through the coding network in order to eliminate artificial false edges in the partial image. Wherein, during the mirror image filling, the filling is [ [0,0], [ padding, padding ], [ padding, padding ], [0,0] ], the padding is 1/2 rounded of the convolution kernel size, and the mirror image filling may use tf.

Optionally, in order to further reduce the amount of operations, the convolution in the coding network employs a depth separable convolution. The depth separable convolution is a point-by-point convolution that decomposes the standard convolution into a depth convolution and 1 × 1 convolution, and can reduce the network parameters and the amount of computation.

In an illustrative example, as shown in fig. 8, after a partial image is input into a first convolution layer 71 of a coding network, the first convolution layer 71 first performs mirror filling on the partial image, then convolves the filled partial image sequentially through depth convolution and point-by-point convolution, and activates a convolution result by using a nonlinear activation function, thereby outputting an image feature.

And step 409, inputting the second local image into the coding network to obtain a second image characteristic output by the coding network.

Similar to the feature extraction process of the first local image, the computer device inputs the second local image into the coding network, and the coding network performs feature extraction on the second local image to output a feature of the second image, which is not described herein again.

It should be noted that there is no strict time sequence between step 408 and step 409, that is, step 408 and step 409 may be executed synchronously, which is not limited in this embodiment of the present application.

And step 410, performing feature fusion on the first image feature and the second image feature to obtain a target image feature.

After the content feature (i.e. the first image feature) and the style feature (i.e. the second image feature) are respectively extracted through the coding network, the computer device further performs feature fusion on the content feature and the style feature to obtain a target image feature, wherein the target image feature is a feature obtained by fusing the content feature of the first partial image and the style feature of the second partial image.

In one possible embodiment, the present step includes the following steps for the way of performing feature fusion.

Firstly, a first mean value vector and a first standard deviation vector are constructed according to a first feature mean value and a first feature standard deviation corresponding to each feature channel in first image features.

In one possible embodiment, the first image feature has a size of a × a × b (i.e. contains b channels), and for each a × a feature map (feature map), the computer device calculates a mean value and a standard deviation of a × a feature values in the feature map, and obtains b mean values and b standard deviations, thereby constructing a first mean vector cmean according to the b mean values and a first standard deviation vector cstd according to the b standard deviations.

In one illustrative example, the first image feature has a size of 27 x 256, and the computer device constructs a resulting 256-dimensional first mean vector cmean and a 256-dimensional first standard deviation vector cstd.

And secondly, constructing a second mean value vector and a second standard deviation vector according to a second feature mean value and a second feature standard deviation corresponding to each feature channel in the second image feature.

Similar to the first step, the computer device calculates a second mean vector smean and a second standard deviation vector sstd.

And thirdly, generating target image characteristics according to the first image characteristics, the first mean vector, the first standard deviation vector, the second mean vector and the second standard deviation vector.

Further, in order to enable the fused target image feature to still embody the content feature of the first local image, on the basis of the first image feature, the computer device further generates the target image feature according to the first mean vector, the first standard deviation vector, the second mean vector and the second standard deviation vector which are obtained through construction.

In one possible implementation, the following formula may be used to generate the target image feature:

wherein c is the first image feature.

Step 411, inputting the target image characteristics into a decoding network to obtain a target local image output by the decoding network, where the decoding network is used to perform image restoration according to the input image characteristics.

After the content features and the style features are fused through the steps, the computer device needs to further utilize a decoding network to perform image restoration on the fused target image features, so that a target local image with both the content features of the first local image and the style features of the second local image is obtained.

Optionally, the decoding network and the encoding network are mirror structures of each other, and the decoding network includes n fire transpose (Firetranship) layers, an upsample (upsample) layer, and a second convolutional layer.

In an illustrative example, a network structure of a decoding network corresponding to the encoding network shown in fig. 7 is shown in fig. 9. From top to bottom, the decoding network comprises a first Fire transpose layer 91, a first upsample layer 92, a second Fire transpose layer 93, a third Fire transpose layer 94, a fourth Fire transpose layer 95, a second upsample layer 96, and a second convolutional layer 97.

Schematically, the structure of the first Fire transpose layer 91, the second Fire transpose layer 93, the third Fire transpose layer 94, and the fourth Fire transpose layer 95 is shown in fig. 9. Each Fire transpose layer includes a separation module 981, a second extrusion module 982, a third extrusion module 983, and a third expansion module 984, the second extrusion module 982 and the third extrusion module 983 being configured to perform feature dimension reduction, and the third expansion module 984 being configured to perform dimension enhancement on features output by the second extrusion module 982 and the third extrusion module 983. Wherein the second squeeze module 982 employs a 1 × 1 convolution kernel, the third squeeze module 983 employs a 3 × 3 convolution kernel, and the third expansion module 984 employs a 1 × 1 convolution kernel.

In one possible implementation, the feature reduction process performed by the Fire transpose layer is as follows:

firstly, dividing input characteristics into two parts according to the number of channels by a separation module, wherein the two parts are x _ l and x _ r respectively;

secondly, performing convolution processing on x _ l by step length 1 according to a convolution kernel of 1 multiplied by 1 by a second extrusion module, and outputting x _ l _ out; (third compression module) convolves x _ r by step size 1 according to a convolution kernel of 3 × 3, and outputs x _ r _ out.

Thirdly, summing the x _ l _ out and the x _ r _ out, and outputting the x _ out;

and fourthly, performing convolution processing on the x _ out by step size 1 according to the convolution kernel of 1 multiplied by 1 and outputting the result (a third expansion module).

Optionally, in fig. 9, the number of depth convolution kernels in the extrusion module in the first Fire transpose layer 91 is 32, and the number of depth convolution kernels in the expansion module is 256;

the number of deep convolution kernels in the extrusion module in the second Fire transpose layer 93 is 32, and the number of deep convolution kernels in the expansion module is 128;

the number of deep convolution kernels in the extrusion module in the third Fire transpose layer 94 is 16, and the number of deep convolution kernels in the expansion module is 128;

the number of deep convolution kernels in the extrusion module in the fourth Fire transpose layer 95 is 16, and the number of deep convolution kernels in the expansion module is 64;

the size of the convolution kernels in the second convolution layer 97 is 3 × 3, the number of convolution kernels is 3, and the step size is 1.

In order to reduce the amount of computation and improve the accuracy of feature expression, an upsampling mode adopted by an upsampling layer in the decoding network is optionally sub-pixel convolution (sub-pixel convolution).

And step 412, fusing the target local image and the image to be processed to generate a target image.

The step 204 may be referred to in the implementation manner of this step, and this embodiment is not described herein again.

In the embodiment, the computer equipment utilizes the facial feature segmentation network to extract the local images, the human face division is not needed, the extraction efficiency is improved, and the accuracy of image extraction is improved; by clipping the gradation matrix, the partial image extraction is performed using the clipped matrix, and the amount of calculation in the partial image extraction can be reduced.

In addition, in this embodiment, the computer device performs feature extraction on the local image by using an encoding network based on squeezet, and performs feature fusion on the extracted content features and style features, so that the image restoration is performed on the fused image features by using a decoding network to obtain the local image after style migration.

In an exemplary embodiment, a computer device performs a style migration process on an image to be processed as shown in FIG. 10.

Firstly, inputting an image to be processed 1001 and a wind lattice image 1002 into a five sense organ segmentation network 1003 respectively to obtain a first gray matrix 1004 and a second gray matrix 1005;

secondly, cutting the first gray matrix 1004 to obtain a first cutting matrix 1006; clipping the second gray scale matrix 1005 to obtain a second clipping matrix 1007;

thirdly, generating a first partial image 1008 according to the image 1001 to be processed and the first cropping matrix 1006; generating a second partial image 1009 according to the style image 1002 and the second clipping matrix 1007;

fourthly, inputting the first local image 1008 and the second local image 1009 into the coding network 1010 respectively to obtain a first image feature 1011 and a second image feature 1012;

fifthly, performing feature fusion on the first image features 1011 and the second image features 1012 to obtain target image features 1013;

sixthly, inputting the target image characteristics 1013 into a decoding network 1014 to obtain a target local image 1015;

and seventhly, generating a target image 1016 according to the image to be processed 1001 and the target local image 1015.

For the training process of the facial features segmentation network and the style migration network in the above embodiment, in a possible implementation manner, the facial features segmentation network and the style migration network are firstly trained separately, and are jointly trained after the separate training is completed.

Optionally, as shown in fig. 11, the training process of the style migration network may include the following steps.

Step 1101, acquiring a first sample image and a second sample image, wherein the first sample image and the second sample image contain images of the same face part, and the styles of the face part in the first sample image and the second sample image are different.

Optionally, the first sample image and the second sample image are partial images containing the same face region. For example, the first sample image and the second sample image are both partial images including the left eye.

In one possible embodiment, the first sample image and the second sample image may be extracted from the images by a trained facial segmentation network.

Step 1102, inputting the first sample image and the second sample image into a coding network to obtain a first sample feature and a second sample feature output by the coding network.

In the training process, the first sample image and the second sample image are respectively input into a coding network to be trained, and feature extraction is carried out on the coding network to be trained to obtain a first sample feature and a second sample feature. The above embodiments can be referred to as a manner of extracting features by using a coding network, and details of this embodiment are not repeated herein.

And 1103, performing feature fusion on the first sample feature and the second sample feature to obtain a target sample feature.

The above embodiments can be referred to for the process of performing feature fusion on the sample features, and this embodiment is not described herein again.

And 1104, inputting the target sample characteristics into a decoding network to obtain a target sample image output by the decoding network, wherein the target sample image is obtained by performing style migration on the first sample image according to the style of the second sample image.

The above embodiments may be referred to in the process of performing image restoration on the target sample feature by using the decoding network to be trained, and this embodiment is not described herein again.

Step 1105, calculating a target loss from the target sample image, the first sample image, and the second sample image.

In order to measure the feature extraction and image restoration quality of the coding network and the decoding network, the computer device takes the first sample image and the second sample image as supervision, and calculates the target loss of the target sample image, so as to adjust the network parameters in the coding network and the decoding network based on the target loss.

Since the target sample image needs to have the content characteristics of the first sample image and the style characteristics of the second sample image, the target loss of the target sample image can be determined from both the content loss and the style loss. In one possible embodiment, this step may include the following steps.

First, a content loss is calculated from the target sample image and the first sample image, the content loss indicating a content difference between the target sample image and the first sample image.

Optionally, the computer device respectively inputs the target sample image and the first sample image into the coding network to obtain a feature map output by an output layer in the coding network, so as to calculate the content loss according to the feature maps corresponding to the target sample image and the first sample image.

In an illustrative example, when the network structure of the encoding network is as shown in fig. 7, the computer device acquires the feature maps (including the feature map of the target sample image and the feature map of the first sample image) output by the fourth Fire layer 77 (i.e., the output layer), and calculates the content loss from the feature maps of the two output images.

Optionally, the content loss calculation formula is as follows:

wherein, P is the target sample image,

is the image characteristic of the target sample image, X is the first sample image,

the image characteristics of the first sample image are represented by l, i is an output layer, and i, j is the horizontal and vertical coordinates of the pixel points in the image.

And secondly, calculating style loss according to the target sample image and the second sample image, wherein the style loss is used for indicating style difference between the target sample image and the second sample image.

To improve the quality of training, the computer device needs to integrate the low-level image feature loss and the high-level image feature loss when calculating the style loss.

Optionally, the computer device inputs the target sample image and the second sample image into the coding network to obtain feature maps output by at least two feature extraction layers in the coding network; and calculating style loss according to the feature maps corresponding to the target sample image and the second sample image, wherein the at least two layers of feature extraction layers comprise an output layer and at least one intermediate layer of the coding network.

In an illustrative example, when the network structure of the coding network is as shown in fig. 7, the computer device obtains the feature maps (including the feature map of the target sample image and the feature map of the first sample image) output by the first convolution layer 71, the first Fire layer 73, the second Fire layer 74 and the fourth Fire layer 77, and calculates the style loss corresponding to the layer according to the feature maps of the two images output by the same layer, so as to superimpose the style losses of the layers, and calculate the total style loss.

In one possible implementation, the computer device first constructs a feature perception matrix and then calculates a style loss using the feature perception matrix.

Optionally, the feature perception matrix constructed by the computer device is as follows:

where h, w, c are the length, width and channel of the image, respectively, c' is the transpose of c, #denotesfeature extraction by the encoding network, and j is the hierarchy of the output feature map (such as the first convolution layer 71, the first Fire layer 73, the second Fire layer 74 and the fourth Fire layer 77 in the above example).

Correspondingly, according to the feature perception matrix, calculating the distance between the style feature of the target sample image and the style feature of the second sample image to obtain the style loss, wherein the style loss is calculated according to the following formula:

wherein y is the style characteristic of the second sample image,

style characteristics of the target sample graph.

And thirdly, calculating target loss according to the content loss and the style loss.

Further, the computer device fuses the content loss and the style loss to obtain a target loss, and a calculation formula of the target loss is as follows:

L＝αL_content+βL_style

wherein α and β are used to control the weight of content loss and style loss in the target loss, respectively, the larger α the more prominent the content features in the target sample image, and the larger β the more prominent the style features of the target sample image.

Step 1106, training the coding network and the decoding network according to the target loss.

In one possible implementation, the computer device adjusts network parameters in the encoding network and the decoding network according to the target loss through a back propagation algorithm (or a gradient descent algorithm), and stops training when the target loss satisfies a convergence condition. The embodiment of the present application does not limit the specific way of training the network according to the target loss.

Referring to fig. 12, a block diagram of an image processing apparatus according to an embodiment of the present application is shown. The apparatus may be implemented as all or a portion of the calling party terminal 210 computer device of fig. 1, by software, hardware, or a combination of both. The device includes:

an obtaining module 1201, configured to obtain an image to be processed and a style image, where the style of a target face part in the image to be processed is different from that of the target face part in the style image;

an extracting module 1202, configured to extract a first partial image from the image to be processed, and extract a second partial image from the style image, where the first partial image and the second partial image include images of the target face region;

a style migration module 1203, configured to perform style migration on the first local image according to the second local image to obtain a target local image, where the style of the target local image is the same as that of the target face part in the first local image;

a generating module 1204, configured to fuse the target local image and the image to be processed, and generate a target image.

Optionally, the style migration module 1203 includes:

a first feature extraction unit, configured to input the first local image into a coding network, so as to obtain a first image feature output by the coding network, where the coding network is configured to perform feature extraction on an input image;

the second characteristic extraction unit is used for inputting the second local image into the coding network to obtain a second image characteristic output by the coding network;

the first feature fusion unit is used for performing feature fusion on the first image feature and the second image feature to obtain a target image feature;

and the first restoring unit is used for inputting the target image characteristics into a decoding network to obtain the target local image output by the decoding network, and the decoding network is used for restoring the image according to the input image characteristics.

Optionally, the encoding network is based on SqueezeNet, and the decoding network and the encoding network are mirror structures;

the coding network comprises a first convolution layer, a maximum pooling layer and n Fire layers, each Fire layer comprises a first extrusion module, a first expansion module and a second expansion module, the first extrusion module is used for feature dimension reduction, the first expansion module and the second expansion module are used for dimension increase of features output by the first extrusion module, and n is a positive integer less than or equal to 4;

the encoding network comprises n Fire transpose layers, an up-sampling layer and a second convolution layer, each Fire transpose layer comprises a second extrusion module, a third extrusion module and a third expansion module, the second extrusion module and the third extrusion module are used for feature dimension reduction, and the third expansion module is used for dimension enhancement of features output by the second extrusion module and the third extrusion module.

Optionally, the first feature fusion unit is configured to:

constructing a first mean value vector and a first standard deviation vector according to a first feature mean value and a first feature standard deviation corresponding to each feature channel in the first image feature;

constructing a second mean vector and a second standard deviation vector according to a second feature mean and a second feature standard deviation corresponding to each feature channel in the second image feature;

and generating the target image feature according to the first image feature, the first mean vector, the first standard deviation vector, the second mean vector and the second standard deviation vector.

Optionally, the extracting module 1202 includes:

the first image extraction unit is used for inputting the image to be processed into a facial feature segmentation network to obtain a first gray matrix output by the facial feature segmentation network, and the first gray matrix is used for expressing the position of the target face part in the image to be processed; cutting the first gray matrix to obtain a first cutting matrix, wherein the first cutting matrix is used for representing a minimum rectangular area containing the target face part; generating the first partial image according to the first cutting matrix and the image to be processed;

the second image extraction unit is used for inputting the style image into the facial features segmentation network to obtain a second gray matrix output by the facial features segmentation network, and the second gray matrix is used for representing the position of the target face part in the style image; cutting the second gray matrix to obtain a second cutting matrix, wherein the second cutting matrix is used for representing a minimum rectangular area containing the target face part; and generating the second local image according to the second cutting matrix and the style image.

Optionally, the apparatus further comprises:

and the filling module is used for carrying out mirror image filling processing on the first partial image and the second partial image, wherein the mirror image filling processing is used for eliminating artificial false edges in the images.

Optionally, the apparatus further comprises:

the system comprises a sample acquisition module, a face recognition module and a face recognition module, wherein the sample acquisition module is used for acquiring a first sample image and a second sample image, the first sample image and the second sample image comprise images of the same face part, and the styles of the face parts in the first sample image and the second sample image are different;

the sample feature extraction module is used for inputting the first sample image and the second sample image into the coding network to obtain a first sample feature and a second sample feature output by the coding network;

the sample feature fusion module is used for performing feature fusion on the first sample feature and the second sample feature by using the sample features to obtain target sample features;

the sample restoration module is used for inputting the target sample characteristics into the decoding network to obtain the target sample image output by the decoding network, wherein the target sample image is obtained by performing style migration on the first sample image according to the style of the second sample image;

a loss calculation module for calculating a target loss from the target sample image, the first sample image and the second sample image;

a training module to train the coding network and the decoding network according to the target loss.

Optionally, the loss calculating module includes:

a first calculation unit configured to calculate a content loss indicating a content difference between the target sample image and the first sample image, from the target sample image and the first sample image;

a second calculation unit configured to calculate a style loss from the target sample image and the second sample image, the style loss indicating a style difference between the target sample image and the second sample image;

a third calculating unit, configured to calculate the target loss according to the content loss and the style loss.

Optionally, the first computing unit is configured to:

inputting the target sample image and the first sample image into the coding network to obtain a feature map output by an output layer in the coding network; calculating the content loss according to the feature maps corresponding to the target sample image and the first sample image respectively;

the second computing unit is configured to:

inputting the target sample image and the second sample image into the coding network to obtain a feature map output by at least two feature extraction layers in the coding network, wherein the at least two feature extraction layers comprise an output layer and at least one middle layer; and calculating the style loss according to the feature maps corresponding to the target sample image and the second sample image.

The embodiment of the present application further provides a computer-readable medium, which stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the image processing method according to the above embodiments.

The embodiment of the present application further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded and executed by the processor to implement the image processing method according to the above embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method of claim 1, wherein performing style migration on the first partial image according to the second partial image to obtain a target partial image comprises:

inputting the first local image into a coding network to obtain a first image feature output by the coding network, wherein the coding network is used for extracting the feature of the input image;

inputting the second local image into the coding network to obtain a second image characteristic output by the coding network;

performing feature fusion on the first image feature and the second image feature to obtain a target image feature;

and inputting the target image characteristics into a decoding network to obtain the target local image output by the decoding network, wherein the decoding network is used for carrying out image restoration according to the input image characteristics.

3. The method of claim 2, wherein the encoding network is based on a squashed network (SqueezeNet), and the decoding network and the encoding network are mirror structures;

the encoding network comprises n Fire transpose layers, an upsampling layer and a second convolution layer, each Fire transpose layer comprises a second extrusion module, a third extrusion module and a third expansion module, the second extrusion module and the third extrusion module are used for feature dimension reduction, and the third expansion module is used for dimension enhancement of features output by the second extrusion module and the third extrusion module.

4. The method of claim 2, wherein the performing feature fusion on the first image feature and the second image feature to obtain a target image feature comprises:

5. The method according to any one of claims 1 to 4, wherein the extracting a first partial image from the image to be processed and a second partial image from the stylized image comprises:

inputting the image to be processed into a facial feature segmentation network to obtain a first gray matrix output by the facial feature segmentation network, wherein the first gray matrix is used for representing the position of the target face part in the image to be processed; cutting the first gray matrix to obtain a first cutting matrix, wherein the first cutting matrix is used for representing a minimum rectangular area containing the target face part; generating the first partial image according to the first cutting matrix and the image to be processed;

inputting the style image into the facial features segmentation network to obtain a second gray matrix output by the facial features segmentation network, wherein the second gray matrix is used for representing the position of the target face part in the style image; cutting the second gray matrix to obtain a second cutting matrix, wherein the second cutting matrix is used for representing a minimum rectangular area containing the target face part; and generating the second local image according to the second cutting matrix and the style image.

6. The method of claim 5, wherein after extracting the first partial image from the image to be processed and extracting the second partial image from the stylized image, the method further comprises:

and carrying out mirror image filling processing on the first partial image and the second partial image, wherein the mirror image filling processing is used for eliminating artificial false edges in the images.

7. The method of any of claims 2 to 4, further comprising:

acquiring a first sample image and a second sample image, wherein the first sample image and the second sample image comprise images of the same human face part, and the human face parts in the first sample image and the second sample image are different in style;

inputting the first sample image and the second sample image into the coding network to obtain a first sample characteristic and a second sample characteristic output by the coding network;

performing feature fusion on the first sample feature and the second sample feature to obtain a target sample feature;

inputting the target sample characteristics into the decoding network to obtain the target sample image output by the decoding network, wherein the target sample image is obtained by performing style migration on the first sample image according to the style of the second sample image;

calculating a target loss from the target sample image, the first sample image and the second sample image;

training the encoding network and the decoding network according to the target loss.

8. The method of claim 7, wherein calculating a target loss from the target sample image, the first sample image, and the second sample image comprises:

calculating a content loss from the target sample image and the first sample image, the content loss indicating a content difference between the target sample image and the first sample image;

calculating a style loss from the target sample image and the second sample image, the style loss indicating a style difference between the target sample image and the second sample image;

calculating the target loss according to the content loss and the style loss.

9. The method of claim 8, wherein calculating a content loss from the target sample image and the first sample image comprises:

said calculating a style loss from said target sample image and said second sample image, comprising:

10. An image processing apparatus, characterized in that the apparatus comprises:

11. A computer device, wherein the computer device comprises a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the image processing method of any of claims 1 to 9.

12. A computer-readable storage medium having stored thereon at least one instruction for execution by a processor to implement the image processing method of any one of claims 1 to 9.