CN109190537B

CN109190537B - Mask perception depth reinforcement learning-based multi-person attitude estimation method

Info

Publication number: CN109190537B
Application number: CN201810968949.9A
Authority: CN
Inventors: 田彦; 王勋; 吴佳辰
Original assignee: Zhejiang Gongshang University
Current assignee: Hangzhou Yunqi Smart Vision Technology Co Ltd
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2020-09-29
Anticipated expiration: 2038-08-23
Also published as: CN109190537A

Abstract

The invention discloses a multi-person posture estimation method based on mask perception deep reinforcement learning, which comprises the steps of firstly constructing a multi-person posture estimation model, wherein the multi-person posture estimation model consists of three sub-networks, namely a detection network for acquiring a detection frame and a mask, a deep reinforcement learning network for improving the positioning accuracy and a single posture estimation network; then training the multi-person attitude estimation model by using the training sample; and inputting the image to be detected into the trained multi-character posture estimation model during testing to obtain character postures in all detection frames of the image to be detected. The method of the invention introduces mask information into a deep reinforcement learning network and a single posture estimation network, improves the effects of the two stages, and solves the problems of gradient disappearance and gradient explosion by introducing a residual structure. Compared with other advanced multi-character posture estimation methods, the method disclosed by the invention is more competitive.

Description

Mask perception depth reinforcement learning-based multi-person attitude estimation method

Technical Field

The invention relates to a human body posture estimation technology, in particular to a multi-human-object posture estimation method based on mask perception deep reinforcement learning.

Background

With the wide application of motion capture technologies such as deployment, fashion design, clinical analysis, human-computer interaction, behavior recognition, motion rehabilitation and the like of a large number of multimedia sensors, human posture estimation becomes a hot spot concerned by the multimedia industry.

Recently, single person pose estimation has made a significant progress by adopting a deep learning based architecture. However, multi-person pose estimation, i.e. determining the pose of multiple people in an image, especially estimating individuals in a crowd, is still a difficult task. The main difficulties of this task are as follows: first, the number of people in an image is unknown, and people may appear anywhere in the image or exist in any proportion. Second, there is some type of interaction between people in the image, such as occlusion, communication, touch, etc., which makes the estimation more difficult. Third, as the number of people in an image increases, the computational complexity increases, which makes designing efficient algorithms a challenge. The main difficulties are shown in fig. 1(a) - (d).

Top-down and bottom-up are the main methods of processing human pose estimates. The top-down approach detects and evaluates each detected person using a single person detector and a single person pose estimator. However, when the distance between the persons is too close, the single person detector fails, and the computational complexity increases as the number of persons in the picture increases. The bottom-up method is opposite to the bottom-up method, joint points are detected firstly, then the posture of a person is judged by combining local environment information, and the method cannot ensure efficiency because the final analysis needs global information.

Due to the fact that the detection results of the top-down method and the bottom-up method are not accurately positioned, the accuracy rate of multi-person posture estimation is further reduced. Fig. 1(e) shows a relationship between the detection result and the human body posture estimation result. In the target detection direction, the intersection ratio of the detection frame and the real frame is generated based on the deep learning mode and is larger than 0.5. However, the redundant detection results are not favorable for human body posture estimation. We need to correct the detection frame according to the original detection result. Deep reinforcement learning is an effective way to select the best means to obtain the optimal value according to the environmental information.

Disclosure of Invention

The invention aims to provide a multi-person posture estimation method based on mask perception deep reinforcement learning, aiming at the defects of the prior art.

The purpose of the invention is realized by the following technical scheme: a multi-person posture estimation method based on mask perception deep reinforcement learning comprises the following steps:

(1) constructing a multi-person attitude estimation model: the multi-character posture estimation model consists of a detection network for acquiring a detection frame and a mask, a deep reinforcement learning network for improving positioning accuracy and a single posture estimation network;

(1.1) detecting a network: obtaining a detection frame of an original image and a human body binary mask in the detection frame through a multitask learning network;

(1.2) deep reinforcement learning network: the method is used for calibrating the positioning result, and different from the conventional sampling calibration-based mode, the calibration mode is expressed as a Markov decision process; updating the detection box by a recursive reward or penalty learning process; the purpose of this section is to learn an optimal policy function that maps state S into behavior A;

in the field of computer vision, most of deep reinforcement learning adopts a method of taking a feature map as a state vector; however, a cluttered background can generate high activation values in the feature map, which can interfere with the calibration result and thus affect the human body posture estimation process; in the present invention, the environment state is defined as a tuple (h, i), where h is the historical decision vector from the decision network, and i is the masked feature map; by using a pre-trained convolutional neural network model f₁Extracting an original feature map from the image x, and transmitting the feature map into a multitask network f₂Taking the multitask network as a focus graph to extract a characteristic graph with a mask; the expression for i is as follows:

i＝f₂(f₁(x))⊙f₁(x)

wherein, the signal is a Hadamard product.

When an accurate foreground mask is used, redundant information in the feature map is removed as; the masked feature map provides low-level information such as shape, outline, high-level information such as pose of a human body; this facilitates the calibration process.

Multiplying the human body binary mask obtained by the detection network with a detection frame image which is adjusted to be matched with the full connection layer of the deep reinforcement learning network in size, and taking the result as the input of the deep reinforcement learning network; the output of the deep reinforcement learning network is the reward value of 11 detection box adjusting behaviors.

The detection box adjustment behavior includes four types: zooming behavior (including zooming out and zooming in), panning behavior (including panning in four directions, up, down, left, and right), termination behavior (whether to terminate the search box adjustment), and aspect ratio adjustment behavior (increase and decrease in the width direction and increase and decrease in the height direction). For the detector to produce relatively stable results, each action may be set to move the window by 0.1 times the current window size.

And selecting the behavior with the maximum reward value to adjust the detection frame, iteratively inputting the newly obtained detection frame image into the depth reinforcement learning network until the behavior with the maximum reward value is the termination behavior, and outputting the calibrated detection frame.

(1.3) the single person posture estimation network is specifically as follows: transmitting the mask and the calibrated detection frame image into a single posture estimation network to obtain a single posture;

(2) training a multi-person attitude estimation model by using a training sample; and inputting the image to be detected into the trained multi-character posture estimation model to obtain the character postures in all detection frames of the image to be detected.

Further, the detection network adopts a two-stage (two-stage) processing mode: in the first stage, extracting a feature map of an original image by using a deep residual error network and generating a plurality of candidate frames by an RPN (resilient packet network); and in the second stage, the candidate frame is transmitted into three branches for multi-task learning, and the classification confidence, the detection frame compensation value and the human body binary mask in the detection frame are respectively obtained.

Further, each branch of the detection network in the second stage adopts the following joint loss function:

L＝L_cls+α₁L_box+α₂L_mask

wherein L is_clsFor classification loss, cross entropy expression is adopted; l is_boxFor positioning loss, the difference between the detection frame and the real frame is measured by adopting an L1 norm; l is_maskFor the segmentation loss, mean binary cross-entropy representation is adopted α₁And α₂To balance the scaling factor of the three losses.

Furthermore, the deep reinforcement learning network comprises an 8 × 8 convolutional layer, a 4 × 4 convolutional layer and a 3 × 3 convolutional layer which are connected in sequence, the output of the 3 × 3 convolutional layer is provided with two branches, one branch obtains an 11-dimensional merit function A (a, s; theta, a) through an 11-dimensional full-connection layer, the other branch obtains a state value function V (s; theta, beta) through a 1-dimensional full-connection layer, wherein theta is a shared convolutional layer parameter, alpha and beta are respective full-connection layer parameters of the two branches, a is a detection frame adjustment behavior, and s is the input of the deep reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function. Q (s, a; θ, α, β) ═ V (s; θ, β) + a (a, s; θ, α).

Further, in the deep reinforcement learning network, the reward value r of the current iteration is expressed as follows:

r(s，a)＝(IoU(w′_i，g_i)-IoU(w_i，g_i))+λb′/B′

wherein the first item IoU (w'_i，g_i)-IoU(w_i，g_i) For the traditional bonus term, the second term λ B '/B' is a regular term added to constrain the detection box size, w_iAnd w'_iG, representing the detection boxes of the object i before and after the action a, respectively_iRepresenting a real box, B 'representing the area of the intersection region of the detection box after the action a and the real box, B' representing the area of the detection box after the action a, IoU representing the area of the intersection region, and λ being a scale factor for controlling a reward term and a regular term (the specific value of λ is determined when an experiment parameter is adjusted, and is generally 1-10).

The termination behavior is an additional behavior that does not move the detection box, but only determines whether the optimal result in reinforcement learning is found, and the reward value thereof is defined as follows:

wherein tau is the threshold value of the cross-over ratio to determine the positive and negative of the reward, and eta is the corresponding reward value.

The action a is selected according to a Q function, which expresses the accumulation of current and future prize values.

a＝argmax_aQ(s，a)

Q(s，a)＝r+γmax_a′Q(s′，a′)

The loss function loss (θ) of the training Q function is expressed as follows:

loss(θ)＝E(r+γmax_a′Q(s′，a′，θ)-Q(s，a，θ))

wherein θ is a parameter of the deep reinforcement learning network, s and a are respectively an input and a detection frame adjustment behavior of the current iteration deep reinforcement learning network, s 'and a' are respectively an input and a detection frame adjustment behavior of the next iteration deep reinforcement learning network, Q (s, a, θ) is a sum of all reward values of the current iteration, Q (s ', a', θ) is a sum of all reward values from the next iteration, r is a reward value of the current iteration, γ is a discount factor, and E is an expectation of a loss value under all iterations.

Further, in the deep reinforcement learning network, in order to improve the learning efficiency of the parameter θ, the following method is adopted:

(a) in order to improve the learning stability, a target network is introduced, the target network is separated from the online network, the online network is updated in each iteration, and the target network is updated at intervals;

(b) in order to avoid trapping in the local minimum, a greedy strategy is adopted as an action strategy;

(c) to solve the data dependency problem, empirical playback (s, a, r, s') is used to store in a buffer from which a fixed number of samples are randomly selected during training to reduce the dependency between data.

Further, the deep reinforcement learning network employs a blanking DQN structure that can quickly identify correct behavior during decision evaluation. As with the labeled Q network, the blanking structure training only needs back propagation and does not need supervised learning or algorithm modification to automatically estimate V (s; theta, beta) and A (a, s; theta, alpha).

Further, the single-person posture estimation network combines a human body binary mask obtained by the detection network with a Cascaded Pyramid Network (CPN) to perform human body posture estimation, and a loss function of the single-person posture estimation network is as follows:

L＝L_inf+kL_mask

wherein L is_infError terms for predicted single-person attitude and true attitude, L_maskK is a scale factor for balancing two terms (k is obtained according to practical experience and is generally 1-5) for representing a regular term of the predicted single posture and human body binary mask error; l is_mask＝∑_pL_pP is a human joint point number, wherein

Wherein,

the predicted value of the p node in the activation graph at the position l is shown, and l is the position with the maximum activation value in the activation graph; m is_lIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.

Further, the multi-character posture estimation model training stage adopts a GPU for calculation.

Preferably, the training details of the detection network are α in the loss function₁And α₂Take 4.0 and 10.0, respectively. The whole network adopts a random gradient descent algorithm with momentum of 0.9, and the weight attenuation is 0.0005. The learning rate is 0.01 for the first 6 ten thousand iterations and 0.001 for the last 2 ten thousand iterations. In each batch, 48 positive samples were taken from 4 trainsPicture, 48 negative examples from cluttered background. In the verification phase, the threshold confidence is set to 0.7 and the intersection ratio for localization is set to 0.6.

In the calibration process based on deep reinforcement learning, 10,000 data are taken as a buffer, and the data volume of a batch of data is 32. The lambda in the loss function is 1-10. In the experimental phase, a greedy strategy was used. In training, after 5,000 times of training, the number of training cycles was decreased from 0.3 to 0.05. The discount factor gamma is 0.9.

In the single posture estimation stage, k in the loss function is set to be 0.4, the model adopts a random gradient descent algorithm, the initial learning rate is 0.0005, and the learning rate of the data set is reduced by half after 10 times of traversal. The weight decay rate was 0.00005 and Batch Normalization (Batch Normalization) was used.

Compared with the prior art, the invention has the beneficial effects that:

1. the multi-person posture estimation method based on mask perception deep reinforcement learning provided by the invention increases the detection accuracy.

2. The masking information is used to eliminate the negative effects of cluttered background information and to select the best behavior according to the reward function.

3. In the human body posture estimation stage, a regularization term is added to punish nodes outside the human body contour.

4. The multi-character posture estimation model is tested on the MPII test set, the average accuracy (mAP) is improved by 1.1 compared with the prior art model, and the average accuracy reaches 73.0 when the multi-character posture estimation model is tested on the MS-COCO test development data set.

Drawings

FIG. 1 is a diagram illustrating the difficulty in estimating the pose of a plurality of human beings according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-character pose estimation framework based on mask perceptual depth reinforcement learning according to an embodiment of the present invention;

FIG. 3 is an activation diagram of a detection block in a detection phase provided by an embodiment of the invention;

FIG. 4 is a schematic representation of different behaviors;

FIG. 5 is a schematic diagram of a deep Q network provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a mask-aware pose estimation framework provided by an embodiment of the present invention;

FIG. 7 is a graph of the accuracy of the state versus reward function provided by an embodiment of the present invention;

FIG. 8 is a test result in an MPII dataset provided by an embodiment of the present invention.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

The multi-person posture estimation method provided by the embodiment can obtain the position and posture information of people with non-fixed number in one image, and can be applied to the multimedia industries of clinical analysis, man-machine interaction, behavior recognition and the like.

The method is adopted to obtain the positioning detection frame and the mask based on the multi-task learning network, calibrate and position by adopting the deep reinforcement learning network, and finally carry out human body posture estimation on the character of the detection frame by utilizing the single posture estimation network. The following describes embodiments of the present invention with reference to the drawings.

Fig. 1 is a schematic diagram of difficulty in estimating pose of multiple persons according to an embodiment of the present invention, where (a) indicates that the number and positions of persons in a picture are unknown. (b) And (c) the (d) respectively represents shielding, communication and contact, and shows the interaction between the people, and the (e) shows the relation between detection of the detection frame and estimation of the human body posture.

Fig. 2 is a schematic diagram of a multi-person posture estimation framework based on mask sensing depth reinforcement learning according to an embodiment of the present invention, where a multitask network synchronously obtains a detection frame and a person mask, and a positioning result is calibrated by using a depth reinforcement learning network. Finally, the pose of each character is estimated using the hourglass network. Both the calibration and estimation phases make use of the mask information.

Fig. 3 is an activation map in a detection frame in a detection stage provided by an embodiment of the present invention, and an activation map (b) is obtained by passing an original picture (a) through a convolutional neural network, where we can see that cluttered background information also generates a high activation value. Fig. 3(c) shows that when an accurate foreground mask is used, redundant information in the feature map will be removed.

Fig. 4 is a schematic diagram of different behaviors, which respectively shows 4 types of behaviors of a scaling behavior, a translation behavior, a termination behavior, and an aspect ratio adjustment behavior.

FIG. 5 is a schematic diagram of a deep Q network according to an embodiment of the present invention, including an 8 × 8 convolutional layer, a 4 × 4 convolutional layer, and a 3 × 3 convolutional layer connected in sequence, where an output of the 3 × 3 convolutional layer has two branches, one branch obtains an 11-dimensional dominance function A (a, s; θ, α) through an 11-dimensional fully-connected layer, and the other branch obtains a state value function V (s; θ, β) through a 1-dimensional fully-connected layer, where θ is a shared convolutional layer parameter, α, β are respective fully-connected layer parameters of the two branches, a is a detection frame adjustment behavior, and s is an input of the deep reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function.

Fig. 6 is a schematic diagram of a posture estimation framework combining masks according to an embodiment of the present invention, where in the single posture estimation network, a human body binary mask obtained by a detection network is combined with a Cascaded Pyramid Network (CPN) to perform human body posture estimation, and a loss function of the single posture estimation network is as follows:

L＝L_inf+kL_mask

Wherein,

representing the predicted value of p node in the activation graph at the position l, wherein l is the position of the p node in the activation graphThe position where the activation value is maximum; m is_lIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.

FIG. 7 is a graph of the accuracy of a state versus a reward function, where (a) is a training accuracy of the state, b is a testing accuracy of the state, c is a training accuracy of the reward function, and d is a testing accuracy of the reward function.

The experimental results on the MPII data set are shown in fig. 8, where (a) shows the result of successful prediction, and (b) shows the result of failed prediction. From the results of the prediction failure, we can conclude that (1) although the detection method is improved, the self-term down method is still affected by early commitment (earlylimit). (2) Our method is applicable to situations where people are present in the predicted area with less interaction.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A multi-person posture estimation method based on mask perception deep reinforcement learning is characterized by comprising the following steps:

the detection network specifically comprises the following steps: obtaining a detection frame of an original image and a human body binary mask in the detection frame through a multitask learning network;

the deep reinforcement learning network specifically comprises the following steps: multiplying the human body binary mask obtained by the detection network with a detection frame image matched with a full connection layer of the deep reinforcement learning network, and taking the multiplied human body binary mask as the input of the deep reinforcement learning network; the output of the deep reinforcement learning network is the reward value of the adjustment behavior of 11 detection frames; the detection frame adjustment behavior includes four types: a zoom behavior, a pan behavior, a terminate behavior, an aspect ratio adjustment behavior; selecting the behavior with the maximum reward value to adjust the detection frame, iteratively inputting the newly obtained detection frame image into the depth reinforcement learning network until the behavior with the maximum reward value is the termination behavior, and outputting the calibrated detection frame;

the single-person posture estimation network specifically comprises the following steps: transmitting the mask and the calibrated detection frame image into a single posture estimation network to obtain a single posture;

2. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 1, wherein the detection network adopts a two-stage processing mode: in the first stage, extracting a feature map of an original image by using a deep residual error network and generating a plurality of candidate frames by an RPN (resilient packet network); and in the second stage, the candidate frame is transmitted into three branches for multi-task learning, and the classification confidence, the detection frame compensation value and the human body binary mask in the detection frame are respectively obtained.

3. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 2, wherein each branch of the detection network at the second stage adopts the following joint loss function:

L＝L_cls+α₁L_box+α₂L_mask

4. The mask sensing depth reinforcement learning-based multi-person posture estimation method according to claim 1, wherein the depth reinforcement learning network comprises an 8 x 8 convolutional layer, a 4 x 4 convolutional layer and a 3 x 3 convolutional layer which are connected in sequence, the output of the 3 x 3 convolutional layer has two branches, one branch obtains an 11-dimensional dominant function A (a, s; theta, alpha) through an 11-dimensional full-connection layer, and the other branch obtains a state value function V (s; theta, beta) through a 1-dimensional full-connection layer, wherein theta is a shared convolutional layer parameter of the depth reinforcement learning network, alpha, beta are respective full-connection layer parameters of the two branches, a is a detection frame adjustment behavior, and s is an input of the depth reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function.

5. The mask aware depth-based reinforcement learning multi-character pose estimation method according to claim 4, wherein in the depth-based reinforcement learning network, the loss function loss (θ) of the Q function is expressed as follows:

loss(θ)＝E(r+γmax_a′Q(s′，a′，θ)-Q(s，a，θ))

wherein θ is a shared convolution layer parameter of the deep reinforcement learning network, s and a are respectively an input and detection frame adjustment behavior of the current iteration deep reinforcement learning network, s 'and a' are respectively an input and detection frame adjustment behavior of the next iteration deep reinforcement learning network, Q (s, a, θ) is a sum of all reward values at the beginning of the current iteration, Q (s ', a', θ) is a sum of all reward values from the beginning of the next iteration, r is a reward value of the current iteration, γ is a discount factor, and E is an expectation of loss values under all iterations.

6. The mask-aware deep reinforcement learning-based multi-character pose estimation method according to claim 5, wherein in the deep reinforcement learning network, the reward value r of the current iteration is expressed as follows:

r(s，a)＝(IoU(w′_i，g_i)-IoU(w_i，g_i))+λb′/B′

wherein the first item IoU (w'_i，g_i)-IoU(w_i，g_i) For the traditional bonus term, the second term λ B '/B' is a regular term added to constrain the detection box size, w_iAnd w'_iG, representing the detection boxes of the object i before and after the action a, respectively_iRepresenting the real box, B 'representing the area of the intersection region of the detection box after action a and the real box, B' representing the area of the detection box after action a, IoU representing the area of the intersection region, and λ being the scale factor controlling the reward term and the regularization term.

7. The method for estimating poses of multiple human beings based on mask perception deep reinforcement learning according to claim 5, characterized in that, in the deep reinforcement learning network, in order to improve the learning efficiency of the parameter θ, the following way is adopted:

(c) to address the data correlation problem, (s, a, r, s') is stored in a buffer using empirical playback, and a fixed number of samples are randomly selected from the buffer during training to reduce the correlation between data.

8. The mask aware depth-based reinforcement learning multi-character pose estimation method as claimed in claim 1, wherein the depth-based reinforcement learning network employs a blanking DQN structure that can quickly identify correct behavior during decision evaluation.

9. The mask-aware deep reinforcement learning-based multi-character posture estimation method as claimed in claim 1, wherein the single posture estimation network combines a human body binary mask obtained by the detection network with the cascaded pyramid network to perform human body posture estimation, and a loss function of the single posture estimation network is as follows:

L＝L_inf+kL_mask

wherein L is_infError terms for predicted single-person attitude and true attitude, L_maskK is a scale factor balancing two terms, which is a regular term representing the predicted single-person posture and human body binary mask error; l is_mask＝∑_pL_pP is the number of the human body joint point;

wherein,

representing the predicted value of the p node at the l position in the activation graph,

the position with the maximum activation value in the activation map; m is_lIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.

10. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 1, wherein the multi-character pose estimation model training phase adopts a GPU for calculation.