CN111161200A

CN111161200A - Human body posture migration method based on attention mechanism

Info

Publication number: CN111161200A
Application number: CN201911332748.0A
Authority: CN
Inventors: 李坤; 张劲松; 杨敬钰; 赵宇阳; 刘烨斌; 戴琼海
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-12-22
Filing date: 2019-12-22
Publication date: 2020-05-15

Abstract

The invention belongs to the field of image synthesis, and aims to realize image synthesis guided by postures and enhance the definition of a generated image and the coincidence degree of the image and a target posture. The invention adopts the technical scheme that a human body posture migration method based on an attention mechanism comprises the following steps: an image preprocessing step: forming training data; attention coding under posture guidance; network building and training: adopting a generated confrontation network model, wherein the network model is divided into a generator and a discriminator; putting the generated image into a discriminator, wherein the discriminator forces the generator to generate a picture closer to reality by distinguishing a real image from the generated image; and finally, the trained generation confrontation network is used for completing the human posture migration. The invention is mainly applied to the image processing occasion.

Description

Human body posture migration method based on attention mechanism

Technical Field

The invention belongs to the field of image synthesis, and particularly relates to an image synthesis technology aiming at human body posture migration based on an attention mechanism. In particular to a human body posture migration method based on an attention mechanism.

Background

Human pose migration is the generation of an image of a particular person making a specified gesture, which can be used to generate a data set for pedestrian re-recognition and like tasks, to resolve these tasks in a data-driven fashion. In view of their importance, more and more researchers are beginning to focus on the human pose migration task. Unlike image synthesis tasks, human pose migration is a conditional image synthesis task. Given an image containing a person and a fixed pose, the task wishes to generate an image of the person making the specified pose.

Most of the existing human body posture migration methods adopt a codec structure, and under the guidance of an input image and a target two-dimensional posture, certain joint points of human body joints are used for coding to learn the conversion from the input image to the target two-dimensional posture. The mainstream human posture transfer technology mainly comprises two types: the conditional variational self-encoder and the conditional generation countermeasure network. The conditional variational self-encoder can well express the transformation relation among postures, but pictures generated by the method are not clear enough. The conditional generation countermeasure network can produce a clearer picture, but the problem of pixel misalignment caused by posture change cannot be well solved, so that the image with a more complex posture is poor in performance.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to:

1) aiming at the problem of pixel misalignment caused by attitude migration which is difficult to process by the conventional method, the invention utilizes an attention mechanism to reform the interior of an image generator so as to realize attitude-guided image synthesis.

2) In order to fully utilize the image information and generate a clear picture, the invention adopts a framework for generating the countermeasure network and simultaneously enhances the definition of the generated image and the coincidence degree of the image and the target posture.

The invention adopts the technical scheme that a human body posture migration method based on an attention mechanism comprises the following steps:

an image preprocessing step: forming training data;

attention coding under gesture guidance: for image feature C^IAnd an attitude feature C^PThe gesture features are used for guiding the image features to be transformed by using a self-attention mechanism, and attention codes under gesture guidance are obtained;

network building and training: generating a confrontation network model, wherein the network model is divided into a generator and a discriminator, the generator part firstly carries out down-sampling convolution module to encode the picture into high-dimensional image characteristics, then carries out attention encoding under the guidance of the posture, finishes the conversion of the image characteristics through multiple encoding, and finally converts the image characteristics into the picture through an up-sampling convolution module; putting the generated image into a discriminator, wherein the discriminator forces the generator to generate a picture closer to reality by distinguishing a real image from the generated image; and finally, the trained generation confrontation network is used for completing the human posture migration.

The image preprocessing comprises the following specific steps: firstly, extracting the postures of a person by using a trained joint detector HPE, then dividing a fixed person and corresponding postures into a group, arranging and combining pictures in each group to form training data, and collecting 263632 groups of training data and 12000 groups of test data for a reference data set Market-1501; for the depfashinon dataset, 101966 sets of training data and 8570 sets of test data were collected.

The attention coding under the posture guidance comprises the following specific steps: firstly, mapping the attitude characteristics into Key and Value respectively through convolution of 1 x 1, wherein the Key and the Value represent information of the attitude characteristics and are in one-to-one correspondence; then multiplying the translated Key with Value to obtain an attention diagram; finally, the image characteristics and the attention map are formed to obtain an attention code under the guidance of the posture;

after the attention coding is obtained, the image feature and the pose feature are spliced for better integration, and after the feedback of the image feature is obtained, the pose feature can further guide the image feature to carry out the subsequent transformation.

The input of the generator is a conditional image I_cConditional image corresponding to pose P_cAnd target attitude P_tOutput as a generated image I_gAfter the image is generated, the generated image is put into a discriminator; the discriminator takes the form of a double discriminator: texture discriminator D_AAnd a shape discriminator D_S(ii) a Texture discriminator D_AInput generated image I_gAnd a conditional image I_cFor judging whether the texture between the two images is consistent, the input is (I)_c,I_t)，(I_c,I_g) A doublet of the conditional image and the target image or the generated image, respectively; shape discriminator D_SInputting a generated image and a target posture for judging whether the generated image conforms to the target posture, wherein the input is (P)_t,I_t)，(P_t,I_g) A target pose and a target image or a binary set of generated images, respectively.

The loss function for generating the countermeasure network model comprises three parts:

1) generating a loss function L for a countermeasure network_CGANThe loss function is used to constrain the relationship between the generator and the arbiter to make the two more balanced, and the loss function contains two parts of the penalty, corresponding to the two arbiters, and the total loss function is defined as follows:

wherein

Respectively representing the distribution of the human body posture, the distribution of the real image and the distribution of the generated image;

2) distance loss L_L1The loss is the distance between pixel points of the generated image and the target image, the generated image is closer to the target image by reducing the loss function, and the loss function is defined as follows:

L_L1＝‖I_g-I_t‖₁, (2)

3) loss of perception L_percepThe perceptual loss is used to reduce the structural difference between the generated image and the target image and make the generated image more natural, and is defined as follows:

wherein

Representing the VGG-19 network model pre-trained on ImageNet data set

The output of the layer(s) is,

to represent

Ith feature map in the layer output.

The final overall loss function is shown in equation (4):

L_full＝αL_CGAN+βL_L1+γL_percep, (4)

wherein α, gamma respectively represent L_CGAN，L_L1，L_percepThe weights of the three parts.

The invention has the characteristics and beneficial effects that:

the invention provides an attention mechanism-based image synthesis system for human body posture migration. Given a picture containing a person and an arbitrary pose, the system can generate a picture of the person making the specified pose. The system introduces an attention mechanism, changes the attention mechanism into an attention mechanism more suitable for posture guidance of the task, and solves the problem of picture pixel misalignment caused in the posture migration process. And meanwhile, the optimal result is obtained on both the Market-1501 data set and the DeepFashinon data set.

Description of the drawings:

FIG. 1 is a system block diagram of a human pose migration technique based on an attention mechanism.

FIG. 2 gives a result graph generated for an arbitrary pose on a Market-1501 data set.

Fig. 3 is a graph of the results generated given an arbitrary pose on the depfashinon dataset.

Figure 4 is a qualitative comparison of the present system with the four other best algorithms currently under this task.

Detailed Description

In order to solve the problems in the prior art, the invention provides an image generation method which is closer to a human thinking mode, and the guiding synthesis of the image pixels by the gestures is completed based on an attention mechanism. The former method mostly adopts the form of human body segmentation, divides the human body into a plurality of parts, carries out rigid body transformation on each part, and then further splices and synthesizes the final result. The methods can better process the condition that the difference between the condition posture and the target posture is small, but the problem of pixel misalignment caused by posture conversion is highlighted when the difference is large. In order to solve the problem of pixel misalignment caused by posture conversion, the invention guides the image characteristics by the posture characteristics based on an attention mechanism, gradually converts the image characteristics from an initial posture to a specified posture, and gradually solves the problem of pixel misalignment. Meanwhile, the invention can generate sufficiently clear pictures by using the framework for generating the countermeasure network.

The invention provides a human body posture migration technology based on an attention mechanism. The technical scheme uses a Market-1501 data set and a DeepFashinon data set as processing objects, and the whole system comprises three parts: data preprocessing, attention coding under the guidance of postures, and network building and training. In order to better complete the task of human body posture migration and generate pictures meeting requirements, network design and network training are two main problems to be solved. The specific technical scheme is as follows:

step one, preprocessing image data:

for the pictures in the two data sets, firstly, the Pose of a person is extracted by using an HPE (Human position Estimation) joint point detector, then, the fixed person and the corresponding Pose are divided into a group, and the pictures in each group are arranged and combined to form training data. For a Market-1501 data set (a reference data set for pedestrian re-recognition work and a reference data set for human posture migration work), 263632 groups of training data and 12000 groups of test data are collected; for the DeepFashinon dataset (containing 80 million pictures, containing different angles, different scenes, buyer show, etc.), we collected 101966 sets of training data and 8570 sets of test data.

Step two, encoding attention under the guidance of the posture:

for image feature C^IAnd an attitude feature C^PThe invention leads the posture characteristic to guide the image characteristic transformation by transforming a self-attention mechanism. Firstly, mapping the attitude characteristics into Key and Value respectively through convolution of 1 x 1, wherein the Key and the Value represent information of the attitude characteristics and are in one-to-one correspondence; then multiplying the translated Key with Value to obtain an attention diagram; and finally, the image features and the attention map are combined to obtain the attention code under the guidance of the posture.

After the attention coding is obtained, the image feature and the posture feature are spliced for better integration. After feedback of the image features is obtained, the pose features may further guide the image features for subsequent transformation.

Step three, network building and training:

the invention adopts a framework for generating the countermeasure network, and is divided into a generator and a discriminator. The generator part firstly carries out a downsampling convolution module to code the picture into high-dimensional image characteristics, then uses an attention coding module under the guidance of the posture to complete the conversion of the image characteristics through multiple times of coding, and finally converts the image characteristics into the picture through an upsampling convolution module. The input of the generator is a conditional image I_cConditional image corresponding to pose P_cAnd target attitude P_tOutput as a generated image I_g. After the image is generated, the generated image is put into a discriminator which distinguishes a real image I_tAnd generation ofImage I_gTo force the generator to generate a picture that is closer to the real one. The invention adopts the form of double discriminators: texture discriminator D_AAnd a shape discriminator D_S. Texture discriminator D_AInput generated image I_gAnd a conditional image I_cFor judging whether the texture between the two images is consistent, the input is (I)_c,I_t)，(I_c,I_g) A doublet of the conditional image and the target image or the generated image, respectively; shape discriminator D_SInputting a generated image and a target posture for judging whether the generated image conforms to the target posture, wherein the input is (P)_t,I_t)，(P_t,I_g) A target pose and a target image or a binary set of generated images, respectively. The loss function of the network model comprises three parts:

1. generating a loss function L for a countermeasure network_CGAN. The loss function is used to constrain the relationship between the generator and the arbiter to make the two more balanced. Corresponding to the two discriminators, the loss function contains two parts of the opposing loss, the total loss function being defined as follows:

wherein

Respectively representing the distribution of the human body posture, the distribution of the real image and the distribution of the generated image.

2. Distance loss L_L1. The loss is a distance between pixel points of the generated image and the target image, and the generated image can be closer to the target image by reducing the loss function. The loss function is defined as follows:

L_L1＝‖I_g-I_t‖₁, (2)

3. loss of perception L_percep. Perceptual loss is used to reduce the structural difference between the generated image and the target image and to make the generated image more natural. The perceptual loss is defined as follows:

wherein

Representing the VGG-19 network model pre-trained on ImageNet data set

The output of the layer(s) is,

to represent

Ith feature map in the layer output.

The final overall loss function is shown in equation (4):

L_full＝αL_CGAN+βL_L1+γL_percep, (4)

The present invention will be described in further detail below with reference to the accompanying drawings and specific experiments.

Fig. 1 is a system framework diagram of a human body posture migration technique based on an attention mechanism according to the present invention, which mainly includes the following steps:

step one, preprocessing image data:

from each group of pictures in the Market-1501 data set and the DeepFashinon data set, the postures of the pictures are extracted by the joint point detector, and the input and output paired images shown in the figure 1 are formed by combining two pictures of the same person with different postures. For the Market-1501 data set, we collected 263632 sets of training data and 12000 sets of test data; for the DeepFashinon dataset, we collected 101966 sets of training data and 8570 sets of test data.

Step two, encoding attention under the guidance of the posture:

the structure of the generator of the system is shown in FIG. 1, and the attention coding module under each posture guidance is shown in the small diagram at the lower right corner in FIG. 1. The generator comprises two encoders and a decoder, wherein the two encoders respectively apply the conditional image I_cConditional image corresponding to pose P_cAnd target attitude P_tThe concatenation is used as input. Both encoders have the same structure, i.e. the downsampled convolutional layer, and the decoder is the upsampled convolutional layer. And the image characteristics are migrated through the attention coding module under the posture guidance provided by the invention. The input to each module is an image feature and a pose feature. For example, the input to the tth module is an image feature

And attitude characteristics

Outputting transformed image features after passing through a module

And attitude characteristics

After the last module, only the transformed image features are required

The final image is generated in the decoder. All experimental results in the invention are tested when T is 6, namely 6 gesture-guided attention coding modules are available.

Step three, network building and training:

the constructed network comprises a generator and two discriminators. The structure of the discriminator is a common Convolutional Neural Network (CNN). For the texture discriminator, the input at each time is the condition image and the target image (I)_c,I_t) And conditional image and generated image (I)_c,I_g) The score is output as a score for judging the consistency of the texture. For the shape discriminator, the input at each time is the target image and the target pose (I)_t,P_t) And generating an image and a target pose (I)_g,P_t) And outputting a score as a score for judging the gesture consistency.

During the course of training, approximately 9 million iterations were performed using the Adam optimizer. The learning rate is initially set to 2 × 10^-4Linear decay guided learning rate was 0 after 6 ten thousand iterations for both datasets we used 6 pose guided attention coding modules with slightly different hyper-parameters α, gamma settings, Market-1501 settings were 5, 10, respectively, on the deepfast dataset 5, 1, respectively.

Figure 4 sets forth a qualitative comparison of the results of the present system with the four methods currently performing optimally on this task. Wherein PG²VUnet, defem, and PATN are methods in the 2017 top conference NIPS, the 2018 top conference CVPR, and the 2019 top conference CVPR, respectively. It can be seen that the system can produce clearer pictures and can well process samples with larger posture change. Meanwhile, the pictures generated by the system can well ensure the texture information in the condition images and ensure better face information.

Table 1 shows a quantitative comparison of the results of the present system with the four methods currently performing optimally on this task.

TABLE 1 quantitative comparison of the present System and the four best algorithms currently under this task

In table 1, SSIM is structural similarity index, i.e. structural loss, which is used to measure the structural similarity between two pictures. Because the Market-1501 data set contains various complex backgrounds, a mask-SSIM (small Scale integration) is adopted as a measurement index; IS the inclusion Score, i.e., the Score derived by the pre-trained inclusion net neural network, used to measure the performance of generating the network synthesis picture. It can be seen that our results are the best system at present in terms of performance for the human pose migration task.

Claims

1. A human body posture migration method based on an attention mechanism is characterized by comprising the following steps:

an image preprocessing step: forming training data;

2. The human body posture migration method based on the attention mechanism as claimed in claim 1, wherein the image preprocessing comprises the following specific steps: firstly, extracting the postures of a person by using a trained joint detector HPE, then dividing a fixed person and corresponding postures into a group, arranging and combining pictures in each group to form training data, and collecting 263632 groups of training data and 12000 groups of test data for a reference data set Market-1501; for the depfashinon dataset, 101966 sets of training data and 8570 sets of test data were collected.

3. The human body posture migration method based on the attention mechanism as claimed in claim 1, wherein the attention coding under the posture guidance comprises the following steps: firstly, mapping the attitude characteristics into Key and Value respectively through convolution of 1 x 1, wherein the Key and the Value represent information of the attitude characteristics and are in one-to-one correspondence; then multiplying the translated Key with Value to obtain an attention diagram; finally, the image characteristics and the attention map are formed to obtain an attention code under the guidance of the posture; after the attention coding is obtained, the image feature and the pose feature are spliced for better integration, and after the feedback of the image feature is obtained, the pose feature can further guide the image feature to carry out the subsequent transformation.

4. The method of claim 1, wherein the generator inputs the condition image I_cConditional image corresponding to pose P_cAnd target attitude P_tOutput as a generated image I_gAfter the image is generated, the generated image is put into a discriminator; the discriminator takes the form of a double discriminator: texture discriminator D_AAnd a shape discriminator D_S(ii) a Texture discriminator D_AInput generated image I_gAnd a conditional image I_cFor judging whether the texture between the two images is consistent, the input is (I)_c,I_t)，(I_c,I_g) A doublet of the conditional image and the target image or the generated image, respectively; shape discriminator D_SInputting a generated image and a target posture for judging whether the generated image conforms to the target posture, wherein the input is (P)_t,I_t)，(P_t,I_g) A target pose and a target image or a binary set of generated images, respectively.

5. The method of human pose migration based on attention mechanism of claim 4, wherein the penalty function for generating the antagonistic network model comprises three parts:

1) generating a loss function L for a countermeasure network_CGANThe loss function is used for constraining the relation between the generator and the discriminator to make the two more balanced, corresponding to the two discriminators, the loss function comprises two parts of countermeasure loss and a total loss functionThe numbers are defined as follows:

wherein

L_L1＝‖I_g-I_t‖₁, (2)

wherein

Representing the VGG-19 network model pre-trained on ImageNet data set

The output of the layer(s) is,

to represent

Ith feature map in the layer output.

6. The method for human pose migration based on attention mechanism as claimed in claim 4 wherein the final overall loss function is shown in equation (4):

L_full＝αL_CGAN+βL_L1+γL_percep, (4)