CN109190537B - Mask perception depth reinforcement learning-based multi-person attitude estimation method - Google Patents
Mask perception depth reinforcement learning-based multi-person attitude estimation method Download PDFInfo
- Publication number
- CN109190537B CN109190537B CN201810968949.9A CN201810968949A CN109190537B CN 109190537 B CN109190537 B CN 109190537B CN 201810968949 A CN201810968949 A CN 201810968949A CN 109190537 B CN109190537 B CN 109190537B
- Authority
- CN
- China
- Prior art keywords
- network
- reinforcement learning
- mask
- detection
- deep reinforcement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 58
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000008447 perception Effects 0.000 title claims abstract description 12
- 238000001514 detection method Methods 0.000 claims abstract description 79
- 230000036544 posture Effects 0.000 claims abstract description 60
- 238000012549 training Methods 0.000 claims abstract description 18
- 230000006399 behavior Effects 0.000 claims description 43
- 241000282414 Homo sapiens Species 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 39
- 230000004913 activation Effects 0.000 claims description 14
- 230000009471 action Effects 0.000 claims description 10
- 230000008901 benefit Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 claims description 2
- 238000012360 testing method Methods 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 2
- 230000002860 competitive effect Effects 0.000 abstract 1
- 230000008034 disappearance Effects 0.000 abstract 1
- 238000004880 explosion Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 10
- 230000003993 interaction Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 101000742346 Crotalus durissus collilineatus Zinc metalloproteinase/disintegrin Proteins 0.000 description 3
- 101000872559 Hediste diversicolor Hemerythrin Proteins 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000004091 panning Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-person posture estimation method based on mask perception deep reinforcement learning, which comprises the steps of firstly constructing a multi-person posture estimation model, wherein the multi-person posture estimation model consists of three sub-networks, namely a detection network for acquiring a detection frame and a mask, a deep reinforcement learning network for improving the positioning accuracy and a single posture estimation network; then training the multi-person attitude estimation model by using the training sample; and inputting the image to be detected into the trained multi-character posture estimation model during testing to obtain character postures in all detection frames of the image to be detected. The method of the invention introduces mask information into a deep reinforcement learning network and a single posture estimation network, improves the effects of the two stages, and solves the problems of gradient disappearance and gradient explosion by introducing a residual structure. Compared with other advanced multi-character posture estimation methods, the method disclosed by the invention is more competitive.
Description
Technical Field
The invention relates to a human body posture estimation technology, in particular to a multi-human-object posture estimation method based on mask perception deep reinforcement learning.
Background
With the wide application of motion capture technologies such as deployment, fashion design, clinical analysis, human-computer interaction, behavior recognition, motion rehabilitation and the like of a large number of multimedia sensors, human posture estimation becomes a hot spot concerned by the multimedia industry.
Recently, single person pose estimation has made a significant progress by adopting a deep learning based architecture. However, multi-person pose estimation, i.e. determining the pose of multiple people in an image, especially estimating individuals in a crowd, is still a difficult task. The main difficulties of this task are as follows: first, the number of people in an image is unknown, and people may appear anywhere in the image or exist in any proportion. Second, there is some type of interaction between people in the image, such as occlusion, communication, touch, etc., which makes the estimation more difficult. Third, as the number of people in an image increases, the computational complexity increases, which makes designing efficient algorithms a challenge. The main difficulties are shown in fig. 1(a) - (d).
Top-down and bottom-up are the main methods of processing human pose estimates. The top-down approach detects and evaluates each detected person using a single person detector and a single person pose estimator. However, when the distance between the persons is too close, the single person detector fails, and the computational complexity increases as the number of persons in the picture increases. The bottom-up method is opposite to the bottom-up method, joint points are detected firstly, then the posture of a person is judged by combining local environment information, and the method cannot ensure efficiency because the final analysis needs global information.
Due to the fact that the detection results of the top-down method and the bottom-up method are not accurately positioned, the accuracy rate of multi-person posture estimation is further reduced. Fig. 1(e) shows a relationship between the detection result and the human body posture estimation result. In the target detection direction, the intersection ratio of the detection frame and the real frame is generated based on the deep learning mode and is larger than 0.5. However, the redundant detection results are not favorable for human body posture estimation. We need to correct the detection frame according to the original detection result. Deep reinforcement learning is an effective way to select the best means to obtain the optimal value according to the environmental information.
Disclosure of Invention
The invention aims to provide a multi-person posture estimation method based on mask perception deep reinforcement learning, aiming at the defects of the prior art.
The purpose of the invention is realized by the following technical scheme: a multi-person posture estimation method based on mask perception deep reinforcement learning comprises the following steps:
(1) constructing a multi-person attitude estimation model: the multi-character posture estimation model consists of a detection network for acquiring a detection frame and a mask, a deep reinforcement learning network for improving positioning accuracy and a single posture estimation network;
(1.1) detecting a network: obtaining a detection frame of an original image and a human body binary mask in the detection frame through a multitask learning network;
(1.2) deep reinforcement learning network: the method is used for calibrating the positioning result, and different from the conventional sampling calibration-based mode, the calibration mode is expressed as a Markov decision process; updating the detection box by a recursive reward or penalty learning process; the purpose of this section is to learn an optimal policy function that maps state S into behavior A;
in the field of computer vision, most of deep reinforcement learning adopts a method of taking a feature map as a state vector; however, a cluttered background can generate high activation values in the feature map, which can interfere with the calibration result and thus affect the human body posture estimation process; in the present invention, the environment state is defined as a tuple (h, i), where h is the historical decision vector from the decision network, and i is the masked feature map; by using a pre-trained convolutional neural network model f1Extracting an original feature map from the image x, and transmitting the feature map into a multitask network f2Taking the multitask network as a focus graph to extract a characteristic graph with a mask; the expression for i is as follows:
i=f2(f1(x))⊙f1(x)
wherein, the signal is a Hadamard product.
When an accurate foreground mask is used, redundant information in the feature map is removed as; the masked feature map provides low-level information such as shape, outline, high-level information such as pose of a human body; this facilitates the calibration process.
Multiplying the human body binary mask obtained by the detection network with a detection frame image which is adjusted to be matched with the full connection layer of the deep reinforcement learning network in size, and taking the result as the input of the deep reinforcement learning network; the output of the deep reinforcement learning network is the reward value of 11 detection box adjusting behaviors.
The detection box adjustment behavior includes four types: zooming behavior (including zooming out and zooming in), panning behavior (including panning in four directions, up, down, left, and right), termination behavior (whether to terminate the search box adjustment), and aspect ratio adjustment behavior (increase and decrease in the width direction and increase and decrease in the height direction). For the detector to produce relatively stable results, each action may be set to move the window by 0.1 times the current window size.
And selecting the behavior with the maximum reward value to adjust the detection frame, iteratively inputting the newly obtained detection frame image into the depth reinforcement learning network until the behavior with the maximum reward value is the termination behavior, and outputting the calibrated detection frame.
(1.3) the single person posture estimation network is specifically as follows: transmitting the mask and the calibrated detection frame image into a single posture estimation network to obtain a single posture;
(2) training a multi-person attitude estimation model by using a training sample; and inputting the image to be detected into the trained multi-character posture estimation model to obtain the character postures in all detection frames of the image to be detected.
Further, the detection network adopts a two-stage (two-stage) processing mode: in the first stage, extracting a feature map of an original image by using a deep residual error network and generating a plurality of candidate frames by an RPN (resilient packet network); and in the second stage, the candidate frame is transmitted into three branches for multi-task learning, and the classification confidence, the detection frame compensation value and the human body binary mask in the detection frame are respectively obtained.
Further, each branch of the detection network in the second stage adopts the following joint loss function:
L=Lcls+α1Lbox+α2Lmask
wherein L isclsFor classification loss, cross entropy expression is adopted; l isboxFor positioning loss, the difference between the detection frame and the real frame is measured by adopting an L1 norm; l ismaskFor the segmentation loss, mean binary cross-entropy representation is adopted α1And α2To balance the scaling factor of the three losses.
Furthermore, the deep reinforcement learning network comprises an 8 × 8 convolutional layer, a 4 × 4 convolutional layer and a 3 × 3 convolutional layer which are connected in sequence, the output of the 3 × 3 convolutional layer is provided with two branches, one branch obtains an 11-dimensional merit function A (a, s; theta, a) through an 11-dimensional full-connection layer, the other branch obtains a state value function V (s; theta, beta) through a 1-dimensional full-connection layer, wherein theta is a shared convolutional layer parameter, alpha and beta are respective full-connection layer parameters of the two branches, a is a detection frame adjustment behavior, and s is the input of the deep reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function. Q (s, a; θ, α, β) ═ V (s; θ, β) + a (a, s; θ, α).
Further, in the deep reinforcement learning network, the reward value r of the current iteration is expressed as follows:
r(s,a)=(IoU(w′i,gi)-IoU(wi,gi))+λb′/B′
wherein the first item IoU (w'i,gi)-IoU(wi,gi) For the traditional bonus term, the second term λ B '/B' is a regular term added to constrain the detection box size, wiAnd w'iG, representing the detection boxes of the object i before and after the action a, respectivelyiRepresenting a real box, B 'representing the area of the intersection region of the detection box after the action a and the real box, B' representing the area of the detection box after the action a, IoU representing the area of the intersection region, and λ being a scale factor for controlling a reward term and a regular term (the specific value of λ is determined when an experiment parameter is adjusted, and is generally 1-10).
The termination behavior is an additional behavior that does not move the detection box, but only determines whether the optimal result in reinforcement learning is found, and the reward value thereof is defined as follows:
wherein tau is the threshold value of the cross-over ratio to determine the positive and negative of the reward, and eta is the corresponding reward value.
The action a is selected according to a Q function, which expresses the accumulation of current and future prize values.
a=argmaxaQ(s,a)
Q(s,a)=r+γmaxa′Q(s′,a′)
The loss function loss (θ) of the training Q function is expressed as follows:
loss(θ)=E(r+γmaxa′Q(s′,a′,θ)-Q(s,a,θ))
wherein θ is a parameter of the deep reinforcement learning network, s and a are respectively an input and a detection frame adjustment behavior of the current iteration deep reinforcement learning network, s 'and a' are respectively an input and a detection frame adjustment behavior of the next iteration deep reinforcement learning network, Q (s, a, θ) is a sum of all reward values of the current iteration, Q (s ', a', θ) is a sum of all reward values from the next iteration, r is a reward value of the current iteration, γ is a discount factor, and E is an expectation of a loss value under all iterations.
Further, in the deep reinforcement learning network, in order to improve the learning efficiency of the parameter θ, the following method is adopted:
(a) in order to improve the learning stability, a target network is introduced, the target network is separated from the online network, the online network is updated in each iteration, and the target network is updated at intervals;
(b) in order to avoid trapping in the local minimum, a greedy strategy is adopted as an action strategy;
(c) to solve the data dependency problem, empirical playback (s, a, r, s') is used to store in a buffer from which a fixed number of samples are randomly selected during training to reduce the dependency between data.
Further, the deep reinforcement learning network employs a blanking DQN structure that can quickly identify correct behavior during decision evaluation. As with the labeled Q network, the blanking structure training only needs back propagation and does not need supervised learning or algorithm modification to automatically estimate V (s; theta, beta) and A (a, s; theta, alpha).
Further, the single-person posture estimation network combines a human body binary mask obtained by the detection network with a Cascaded Pyramid Network (CPN) to perform human body posture estimation, and a loss function of the single-person posture estimation network is as follows:
L=Linf+kLmask
wherein L isinfError terms for predicted single-person attitude and true attitude, LmaskK is a scale factor for balancing two terms (k is obtained according to practical experience and is generally 1-5) for representing a regular term of the predicted single posture and human body binary mask error; l ismask=∑pLpP is a human joint point number, wherein
Wherein,the predicted value of the p node in the activation graph at the position l is shown, and l is the position with the maximum activation value in the activation graph; m islIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.
Further, the multi-character posture estimation model training stage adopts a GPU for calculation.
Preferably, the training details of the detection network are α in the loss function1And α2Take 4.0 and 10.0, respectively. The whole network adopts a random gradient descent algorithm with momentum of 0.9, and the weight attenuation is 0.0005. The learning rate is 0.01 for the first 6 ten thousand iterations and 0.001 for the last 2 ten thousand iterations. In each batch, 48 positive samples were taken from 4 trainsPicture, 48 negative examples from cluttered background. In the verification phase, the threshold confidence is set to 0.7 and the intersection ratio for localization is set to 0.6.
In the calibration process based on deep reinforcement learning, 10,000 data are taken as a buffer, and the data volume of a batch of data is 32. The lambda in the loss function is 1-10. In the experimental phase, a greedy strategy was used. In training, after 5,000 times of training, the number of training cycles was decreased from 0.3 to 0.05. The discount factor gamma is 0.9.
In the single posture estimation stage, k in the loss function is set to be 0.4, the model adopts a random gradient descent algorithm, the initial learning rate is 0.0005, and the learning rate of the data set is reduced by half after 10 times of traversal. The weight decay rate was 0.00005 and Batch Normalization (Batch Normalization) was used.
Compared with the prior art, the invention has the beneficial effects that:
1. the multi-person posture estimation method based on mask perception deep reinforcement learning provided by the invention increases the detection accuracy.
2. The masking information is used to eliminate the negative effects of cluttered background information and to select the best behavior according to the reward function.
3. In the human body posture estimation stage, a regularization term is added to punish nodes outside the human body contour.
4. The multi-character posture estimation model is tested on the MPII test set, the average accuracy (mAP) is improved by 1.1 compared with the prior art model, and the average accuracy reaches 73.0 when the multi-character posture estimation model is tested on the MS-COCO test development data set.
Drawings
FIG. 1 is a diagram illustrating the difficulty in estimating the pose of a plurality of human beings according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a multi-character pose estimation framework based on mask perceptual depth reinforcement learning according to an embodiment of the present invention;
FIG. 3 is an activation diagram of a detection block in a detection phase provided by an embodiment of the invention;
FIG. 4 is a schematic representation of different behaviors;
FIG. 5 is a schematic diagram of a deep Q network provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of a mask-aware pose estimation framework provided by an embodiment of the present invention;
FIG. 7 is a graph of the accuracy of the state versus reward function provided by an embodiment of the present invention;
FIG. 8 is a test result in an MPII dataset provided by an embodiment of the present invention.
Detailed Description
In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.
The multi-person posture estimation method provided by the embodiment can obtain the position and posture information of people with non-fixed number in one image, and can be applied to the multimedia industries of clinical analysis, man-machine interaction, behavior recognition and the like.
The method is adopted to obtain the positioning detection frame and the mask based on the multi-task learning network, calibrate and position by adopting the deep reinforcement learning network, and finally carry out human body posture estimation on the character of the detection frame by utilizing the single posture estimation network. The following describes embodiments of the present invention with reference to the drawings.
Fig. 1 is a schematic diagram of difficulty in estimating pose of multiple persons according to an embodiment of the present invention, where (a) indicates that the number and positions of persons in a picture are unknown. (b) And (c) the (d) respectively represents shielding, communication and contact, and shows the interaction between the people, and the (e) shows the relation between detection of the detection frame and estimation of the human body posture.
Fig. 2 is a schematic diagram of a multi-person posture estimation framework based on mask sensing depth reinforcement learning according to an embodiment of the present invention, where a multitask network synchronously obtains a detection frame and a person mask, and a positioning result is calibrated by using a depth reinforcement learning network. Finally, the pose of each character is estimated using the hourglass network. Both the calibration and estimation phases make use of the mask information.
Fig. 3 is an activation map in a detection frame in a detection stage provided by an embodiment of the present invention, and an activation map (b) is obtained by passing an original picture (a) through a convolutional neural network, where we can see that cluttered background information also generates a high activation value. Fig. 3(c) shows that when an accurate foreground mask is used, redundant information in the feature map will be removed.
Fig. 4 is a schematic diagram of different behaviors, which respectively shows 4 types of behaviors of a scaling behavior, a translation behavior, a termination behavior, and an aspect ratio adjustment behavior.
FIG. 5 is a schematic diagram of a deep Q network according to an embodiment of the present invention, including an 8 × 8 convolutional layer, a 4 × 4 convolutional layer, and a 3 × 3 convolutional layer connected in sequence, where an output of the 3 × 3 convolutional layer has two branches, one branch obtains an 11-dimensional dominance function A (a, s; θ, α) through an 11-dimensional fully-connected layer, and the other branch obtains a state value function V (s; θ, β) through a 1-dimensional fully-connected layer, where θ is a shared convolutional layer parameter, α, β are respective fully-connected layer parameters of the two branches, a is a detection frame adjustment behavior, and s is an input of the deep reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function.
Fig. 6 is a schematic diagram of a posture estimation framework combining masks according to an embodiment of the present invention, where in the single posture estimation network, a human body binary mask obtained by a detection network is combined with a Cascaded Pyramid Network (CPN) to perform human body posture estimation, and a loss function of the single posture estimation network is as follows:
L=Linf+kLmask
wherein L isinfError terms for predicted single-person attitude and true attitude, LmaskK is a scale factor for balancing two terms (k is obtained according to practical experience and is generally 1-5) for representing a regular term of the predicted single posture and human body binary mask error; l ismask=∑pLpP is a human joint point number, wherein
Wherein,representing the predicted value of p node in the activation graph at the position l, wherein l is the position of the p node in the activation graphThe position where the activation value is maximum; m islIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.
FIG. 7 is a graph of the accuracy of a state versus a reward function, where (a) is a training accuracy of the state, b is a testing accuracy of the state, c is a training accuracy of the reward function, and d is a testing accuracy of the reward function.
The experimental results on the MPII data set are shown in fig. 8, where (a) shows the result of successful prediction, and (b) shows the result of failed prediction. From the results of the prediction failure, we can conclude that (1) although the detection method is improved, the self-term down method is still affected by early commitment (earlylimit). (2) Our method is applicable to situations where people are present in the predicted area with less interaction.
The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.
Claims (10)
1. A multi-person posture estimation method based on mask perception deep reinforcement learning is characterized by comprising the following steps:
(1) constructing a multi-person attitude estimation model: the multi-character posture estimation model consists of a detection network for acquiring a detection frame and a mask, a deep reinforcement learning network for improving positioning accuracy and a single posture estimation network;
the detection network specifically comprises the following steps: obtaining a detection frame of an original image and a human body binary mask in the detection frame through a multitask learning network;
the deep reinforcement learning network specifically comprises the following steps: multiplying the human body binary mask obtained by the detection network with a detection frame image matched with a full connection layer of the deep reinforcement learning network, and taking the multiplied human body binary mask as the input of the deep reinforcement learning network; the output of the deep reinforcement learning network is the reward value of the adjustment behavior of 11 detection frames; the detection frame adjustment behavior includes four types: a zoom behavior, a pan behavior, a terminate behavior, an aspect ratio adjustment behavior; selecting the behavior with the maximum reward value to adjust the detection frame, iteratively inputting the newly obtained detection frame image into the depth reinforcement learning network until the behavior with the maximum reward value is the termination behavior, and outputting the calibrated detection frame;
the single-person posture estimation network specifically comprises the following steps: transmitting the mask and the calibrated detection frame image into a single posture estimation network to obtain a single posture;
(2) training a multi-person attitude estimation model by using a training sample; and inputting the image to be detected into the trained multi-character posture estimation model to obtain the character postures in all detection frames of the image to be detected.
2. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 1, wherein the detection network adopts a two-stage processing mode: in the first stage, extracting a feature map of an original image by using a deep residual error network and generating a plurality of candidate frames by an RPN (resilient packet network); and in the second stage, the candidate frame is transmitted into three branches for multi-task learning, and the classification confidence, the detection frame compensation value and the human body binary mask in the detection frame are respectively obtained.
3. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 2, wherein each branch of the detection network at the second stage adopts the following joint loss function:
L=Lcls+α1Lbox+α2Lmask
wherein L isclsFor classification loss, cross entropy expression is adopted; l isboxFor positioning loss, the difference between the detection frame and the real frame is measured by adopting an L1 norm; l ismaskFor the segmentation loss, mean binary cross-entropy representation is adopted α1And α2To balance the scaling factor of the three losses.
4. The mask sensing depth reinforcement learning-based multi-person posture estimation method according to claim 1, wherein the depth reinforcement learning network comprises an 8 x 8 convolutional layer, a 4 x 4 convolutional layer and a 3 x 3 convolutional layer which are connected in sequence, the output of the 3 x 3 convolutional layer has two branches, one branch obtains an 11-dimensional dominant function A (a, s; theta, alpha) through an 11-dimensional full-connection layer, and the other branch obtains a state value function V (s; theta, beta) through a 1-dimensional full-connection layer, wherein theta is a shared convolutional layer parameter of the depth reinforcement learning network, alpha, beta are respective full-connection layer parameters of the two branches, a is a detection frame adjustment behavior, and s is an input of the depth reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function.
5. The mask aware depth-based reinforcement learning multi-character pose estimation method according to claim 4, wherein in the depth-based reinforcement learning network, the loss function loss (θ) of the Q function is expressed as follows:
loss(θ)=E(r+γmaxa′Q(s′,a′,θ)-Q(s,a,θ))
wherein θ is a shared convolution layer parameter of the deep reinforcement learning network, s and a are respectively an input and detection frame adjustment behavior of the current iteration deep reinforcement learning network, s 'and a' are respectively an input and detection frame adjustment behavior of the next iteration deep reinforcement learning network, Q (s, a, θ) is a sum of all reward values at the beginning of the current iteration, Q (s ', a', θ) is a sum of all reward values from the beginning of the next iteration, r is a reward value of the current iteration, γ is a discount factor, and E is an expectation of loss values under all iterations.
6. The mask-aware deep reinforcement learning-based multi-character pose estimation method according to claim 5, wherein in the deep reinforcement learning network, the reward value r of the current iteration is expressed as follows:
r(s,a)=(IoU(w′i,gi)-IoU(wi,gi))+λb′/B′
wherein the first item IoU (w'i,gi)-IoU(wi,gi) For the traditional bonus term, the second term λ B '/B' is a regular term added to constrain the detection box size, wiAnd w'iG, representing the detection boxes of the object i before and after the action a, respectivelyiRepresenting the real box, B 'representing the area of the intersection region of the detection box after action a and the real box, B' representing the area of the detection box after action a, IoU representing the area of the intersection region, and λ being the scale factor controlling the reward term and the regularization term.
7. The method for estimating poses of multiple human beings based on mask perception deep reinforcement learning according to claim 5, characterized in that, in the deep reinforcement learning network, in order to improve the learning efficiency of the parameter θ, the following way is adopted:
(a) in order to improve the learning stability, a target network is introduced, the target network is separated from the online network, the online network is updated in each iteration, and the target network is updated at intervals;
(b) in order to avoid trapping in the local minimum, a greedy strategy is adopted as an action strategy;
(c) to address the data correlation problem, (s, a, r, s') is stored in a buffer using empirical playback, and a fixed number of samples are randomly selected from the buffer during training to reduce the correlation between data.
8. The mask aware depth-based reinforcement learning multi-character pose estimation method as claimed in claim 1, wherein the depth-based reinforcement learning network employs a blanking DQN structure that can quickly identify correct behavior during decision evaluation.
9. The mask-aware deep reinforcement learning-based multi-character posture estimation method as claimed in claim 1, wherein the single posture estimation network combines a human body binary mask obtained by the detection network with the cascaded pyramid network to perform human body posture estimation, and a loss function of the single posture estimation network is as follows:
L=Linf+kLmask
wherein L isinfError terms for predicted single-person attitude and true attitude, LmaskK is a scale factor balancing two terms, which is a regular term representing the predicted single-person posture and human body binary mask error; l ismask=∑pLpP is the number of the human body joint point;
wherein,representing the predicted value of the p node at the l position in the activation graph,the position with the maximum activation value in the activation map; m islIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.
10. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 1, wherein the multi-character pose estimation model training phase adopts a GPU for calculation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810968949.9A CN109190537B (en) | 2018-08-23 | 2018-08-23 | Mask perception depth reinforcement learning-based multi-person attitude estimation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810968949.9A CN109190537B (en) | 2018-08-23 | 2018-08-23 | Mask perception depth reinforcement learning-based multi-person attitude estimation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109190537A CN109190537A (en) | 2019-01-11 |
CN109190537B true CN109190537B (en) | 2020-09-29 |
Family
ID=64919381
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810968949.9A Active CN109190537B (en) | 2018-08-23 | 2018-08-23 | Mask perception depth reinforcement learning-based multi-person attitude estimation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109190537B (en) |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109766887B (en) * | 2019-01-16 | 2022-11-11 | 中国科学院光电技术研究所 | Multi-target detection method based on cascaded hourglass neural network |
CN109784296A (en) * | 2019-01-27 | 2019-05-21 | 武汉星巡智能科技有限公司 | Bus occupant quantity statistics method, device and computer readable storage medium |
CN110008915B (en) * | 2019-04-11 | 2023-02-03 | 电子科技大学 | System and method for estimating dense human body posture based on mask-RCNN |
CN110188219B (en) * | 2019-05-16 | 2023-01-06 | 复旦大学 | Depth-enhanced redundancy-removing hash method for image retrieval |
CN110222636B (en) * | 2019-05-31 | 2023-04-07 | 中国民航大学 | Pedestrian attribute identification method based on background suppression |
CN110210402B (en) * | 2019-06-03 | 2021-11-19 | 北京卡路里信息技术有限公司 | Feature extraction method and device, terminal equipment and storage medium |
CN110197163B (en) * | 2019-06-04 | 2021-02-12 | 中国矿业大学 | Target tracking sample expansion method based on pedestrian search |
CN110415332A (en) * | 2019-06-21 | 2019-11-05 | 上海工程技术大学 | Complex textile surface three dimensional reconstruction system and method under a kind of non-single visual angle |
CN110335291A (en) * | 2019-07-01 | 2019-10-15 | 腾讯科技(深圳)有限公司 | Personage's method for tracing and terminal |
CN112184802B (en) * | 2019-07-05 | 2023-10-20 | 杭州海康威视数字技术股份有限公司 | Calibration frame adjusting method, device and storage medium |
CN112241976B (en) * | 2019-07-19 | 2024-08-27 | 杭州海康威视数字技术股份有限公司 | Model training method and device |
CN110569719B (en) * | 2019-07-30 | 2022-05-17 | 中国科学技术大学 | Animal head posture estimation method and system |
CN110866872B (en) * | 2019-10-10 | 2022-07-29 | 北京邮电大学 | Pavement crack image preprocessing intelligent selection method and device and electronic equipment |
CN111415389B (en) * | 2020-03-18 | 2023-08-29 | 清华大学 | Label-free six-dimensional object posture prediction method and device based on reinforcement learning |
CN111738091A (en) * | 2020-05-27 | 2020-10-02 | 复旦大学 | Posture estimation and human body analysis system based on multi-task deep learning |
CN111695457B (en) * | 2020-05-28 | 2023-05-09 | 浙江工商大学 | Human body posture estimation method based on weak supervision mechanism |
CN112052886B (en) * | 2020-08-21 | 2022-06-03 | 暨南大学 | Intelligent human body action posture estimation method and device based on convolutional neural network |
CN112884780A (en) * | 2021-02-06 | 2021-06-01 | 罗普特科技集团股份有限公司 | Estimation method and system for human body posture |
CN113012229A (en) * | 2021-03-26 | 2021-06-22 | 北京华捷艾米科技有限公司 | Method and device for positioning human body joint points |
CN113361570B (en) * | 2021-05-25 | 2022-11-01 | 东南大学 | 3D human body posture estimation method based on joint data enhancement and network training model |
CN113436633B (en) * | 2021-06-30 | 2024-03-12 | 平安科技(深圳)有限公司 | Speaker recognition method, speaker recognition device, computer equipment and storage medium |
CN113537070B (en) * | 2021-07-19 | 2022-11-22 | 中国第一汽车股份有限公司 | Detection method, detection device, electronic equipment and storage medium |
CN114143710B (en) * | 2021-11-22 | 2022-10-04 | 武汉大学 | Wireless positioning method and system based on reinforcement learning |
CN116721471A (en) * | 2023-08-10 | 2023-09-08 | 中国科学院合肥物质科学研究院 | Multi-person three-dimensional attitude estimation method based on multi-view angles |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150544A (en) * | 2011-08-30 | 2013-06-12 | 精工爱普生株式会社 | Method and apparatus for object pose estimation |
CN106780536A (en) * | 2017-01-13 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of shape based on object mask network perceives example dividing method |
CN106780569A (en) * | 2016-11-18 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of human body attitude estimates behavior analysis method |
CN106897697A (en) * | 2017-02-24 | 2017-06-27 | 深圳市唯特视科技有限公司 | A kind of personage and pose detection method based on visualization compiler |
CN106951512A (en) * | 2017-03-17 | 2017-07-14 | 深圳市唯特视科技有限公司 | A kind of end-to-end session control method based on hybrid coding network |
CN107392118A (en) * | 2017-07-04 | 2017-11-24 | 竹间智能科技(上海)有限公司 | The recognition methods of reinforcing face character and the system of generation network are resisted based on multitask |
CN107944443A (en) * | 2017-11-16 | 2018-04-20 | 深圳市唯特视科技有限公司 | One kind carries out object consistency detection method based on end-to-end deep learning |
CN108256489A (en) * | 2018-01-24 | 2018-07-06 | 清华大学 | Behavior prediction method and device based on deeply study |
CN108304795A (en) * | 2018-01-29 | 2018-07-20 | 清华大学 | Human skeleton Activity recognition method and device based on deeply study |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11004358B2 (en) * | 2015-04-22 | 2021-05-11 | Jeffrey B. Matthews | Visual and kinesthetic method and educational kit for solving algebraic linear equations involving an unknown variable |
US10607342B2 (en) * | 2016-09-30 | 2020-03-31 | Siemenes Healthcare GmbH | Atlas-based contouring of organs at risk for radiation therapy |
-
2018
- 2018-08-23 CN CN201810968949.9A patent/CN109190537B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150544A (en) * | 2011-08-30 | 2013-06-12 | 精工爱普生株式会社 | Method and apparatus for object pose estimation |
CN106780569A (en) * | 2016-11-18 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of human body attitude estimates behavior analysis method |
CN106780536A (en) * | 2017-01-13 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of shape based on object mask network perceives example dividing method |
CN106897697A (en) * | 2017-02-24 | 2017-06-27 | 深圳市唯特视科技有限公司 | A kind of personage and pose detection method based on visualization compiler |
CN106951512A (en) * | 2017-03-17 | 2017-07-14 | 深圳市唯特视科技有限公司 | A kind of end-to-end session control method based on hybrid coding network |
CN107392118A (en) * | 2017-07-04 | 2017-11-24 | 竹间智能科技(上海)有限公司 | The recognition methods of reinforcing face character and the system of generation network are resisted based on multitask |
CN107944443A (en) * | 2017-11-16 | 2018-04-20 | 深圳市唯特视科技有限公司 | One kind carries out object consistency detection method based on end-to-end deep learning |
CN108256489A (en) * | 2018-01-24 | 2018-07-06 | 清华大学 | Behavior prediction method and device based on deeply study |
CN108304795A (en) * | 2018-01-29 | 2018-07-20 | 清华大学 | Human skeleton Activity recognition method and device based on deeply study |
Non-Patent Citations (4)
Title |
---|
Canonical Locality Preserving Latent Variable Model for Discriminative Pose Inference;Yan Tian 等;《Image and Vision Computing》;20130331;第223-230页 * |
DNN-BASED SOURCE ENHANCEMENT SELF-OPTIMIZED BY REINFORCEMENT LEARNING USING SOUND QUALITY MEASUREMENTS;Koizumi, Yuma 等;《2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》;20170509;第81-85页 * |
图像和视频中基于部件检测器的人体姿态估计;苏延超 等;《电子与信息学报》;20110630;第33卷(第6期);第1413-1419页 * |
目标跟踪算法综述;卢湖川 等;《模式识别与人工智能》;20180131;第31卷(第1期);第61-76页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109190537A (en) | 2019-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109190537B (en) | Mask perception depth reinforcement learning-based multi-person attitude estimation method | |
US11468697B2 (en) | Pedestrian re-identification method based on spatio-temporal joint model of residual attention mechanism and device thereof | |
CN108154118B (en) | A kind of target detection system and method based on adaptive combined filter and multistage detection | |
CN107529650B (en) | Closed loop detection method and device and computer equipment | |
US10957309B2 (en) | Neural network method and apparatus | |
CN106570464B (en) | Face recognition method and device for rapidly processing face shielding | |
CN111310672A (en) | Video emotion recognition method, device and medium based on time sequence multi-model fusion modeling | |
CN113570029A (en) | Method for obtaining neural network model, image processing method and device | |
CN113705769A (en) | Neural network training method and device | |
CN108182260B (en) | Multivariate time sequence classification method based on semantic selection | |
CN104504366A (en) | System and method for smiling face recognition based on optical flow features | |
CN110716792B (en) | Target detector and construction method and application thereof | |
CN111598183A (en) | Multi-feature fusion image description method | |
CN113592060A (en) | Neural network optimization method and device | |
CN113407820B (en) | Method for processing data by using model, related system and storage medium | |
CN115187786A (en) | Rotation-based CenterNet2 target detection method | |
CN113781517A (en) | System and method for motion estimation | |
CN113326735A (en) | Multi-mode small target detection method based on YOLOv5 | |
CN114140885A (en) | Emotion analysis model generation method and device, electronic equipment and storage medium | |
CN109345559B (en) | Moving target tracking method based on sample expansion and depth classification network | |
US20050114103A1 (en) | System and method for sequential kernel density approximation through mode propagation | |
CN116486233A (en) | Target detection method for multispectral double-flow network | |
CN112989952B (en) | Crowd density estimation method and device based on mask guidance | |
KR20230141828A (en) | Neural networks using adaptive gradient clipping | |
CN112053386B (en) | Target tracking method based on depth convolution characteristic self-adaptive integration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210929 Address after: 310000 Room 401, building 2, No.16, Zhuantang science and technology economic block, Xihu District, Hangzhou City, Zhejiang Province Patentee after: Hangzhou yunqi smart Vision Technology Co., Ltd Address before: 310018, No. 18 Jiao Tong Street, Xiasha Higher Education Park, Hangzhou, Zhejiang Patentee before: ZHEJIANG GONGSHANG University |
|
TR01 | Transfer of patent right |