[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN109190537B - Mask perception depth reinforcement learning-based multi-person attitude estimation method - Google Patents

Mask perception depth reinforcement learning-based multi-person attitude estimation method Download PDF

Info

Publication number
CN109190537B
CN109190537B CN201810968949.9A CN201810968949A CN109190537B CN 109190537 B CN109190537 B CN 109190537B CN 201810968949 A CN201810968949 A CN 201810968949A CN 109190537 B CN109190537 B CN 109190537B
Authority
CN
China
Prior art keywords
network
reinforcement learning
mask
detection
deep reinforcement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810968949.9A
Other languages
Chinese (zh)
Other versions
CN109190537A (en
Inventor
田彦
王勋
吴佳辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yunqi Smart Vision Technology Co Ltd
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN201810968949.9A priority Critical patent/CN109190537B/en
Publication of CN109190537A publication Critical patent/CN109190537A/en
Application granted granted Critical
Publication of CN109190537B publication Critical patent/CN109190537B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-person posture estimation method based on mask perception deep reinforcement learning, which comprises the steps of firstly constructing a multi-person posture estimation model, wherein the multi-person posture estimation model consists of three sub-networks, namely a detection network for acquiring a detection frame and a mask, a deep reinforcement learning network for improving the positioning accuracy and a single posture estimation network; then training the multi-person attitude estimation model by using the training sample; and inputting the image to be detected into the trained multi-character posture estimation model during testing to obtain character postures in all detection frames of the image to be detected. The method of the invention introduces mask information into a deep reinforcement learning network and a single posture estimation network, improves the effects of the two stages, and solves the problems of gradient disappearance and gradient explosion by introducing a residual structure. Compared with other advanced multi-character posture estimation methods, the method disclosed by the invention is more competitive.

Description

Mask perception depth reinforcement learning-based multi-person attitude estimation method
Technical Field
The invention relates to a human body posture estimation technology, in particular to a multi-human-object posture estimation method based on mask perception deep reinforcement learning.
Background
With the wide application of motion capture technologies such as deployment, fashion design, clinical analysis, human-computer interaction, behavior recognition, motion rehabilitation and the like of a large number of multimedia sensors, human posture estimation becomes a hot spot concerned by the multimedia industry.
Recently, single person pose estimation has made a significant progress by adopting a deep learning based architecture. However, multi-person pose estimation, i.e. determining the pose of multiple people in an image, especially estimating individuals in a crowd, is still a difficult task. The main difficulties of this task are as follows: first, the number of people in an image is unknown, and people may appear anywhere in the image or exist in any proportion. Second, there is some type of interaction between people in the image, such as occlusion, communication, touch, etc., which makes the estimation more difficult. Third, as the number of people in an image increases, the computational complexity increases, which makes designing efficient algorithms a challenge. The main difficulties are shown in fig. 1(a) - (d).
Top-down and bottom-up are the main methods of processing human pose estimates. The top-down approach detects and evaluates each detected person using a single person detector and a single person pose estimator. However, when the distance between the persons is too close, the single person detector fails, and the computational complexity increases as the number of persons in the picture increases. The bottom-up method is opposite to the bottom-up method, joint points are detected firstly, then the posture of a person is judged by combining local environment information, and the method cannot ensure efficiency because the final analysis needs global information.
Due to the fact that the detection results of the top-down method and the bottom-up method are not accurately positioned, the accuracy rate of multi-person posture estimation is further reduced. Fig. 1(e) shows a relationship between the detection result and the human body posture estimation result. In the target detection direction, the intersection ratio of the detection frame and the real frame is generated based on the deep learning mode and is larger than 0.5. However, the redundant detection results are not favorable for human body posture estimation. We need to correct the detection frame according to the original detection result. Deep reinforcement learning is an effective way to select the best means to obtain the optimal value according to the environmental information.
Disclosure of Invention
The invention aims to provide a multi-person posture estimation method based on mask perception deep reinforcement learning, aiming at the defects of the prior art.
The purpose of the invention is realized by the following technical scheme: a multi-person posture estimation method based on mask perception deep reinforcement learning comprises the following steps:
(1) constructing a multi-person attitude estimation model: the multi-character posture estimation model consists of a detection network for acquiring a detection frame and a mask, a deep reinforcement learning network for improving positioning accuracy and a single posture estimation network;
(1.1) detecting a network: obtaining a detection frame of an original image and a human body binary mask in the detection frame through a multitask learning network;
(1.2) deep reinforcement learning network: the method is used for calibrating the positioning result, and different from the conventional sampling calibration-based mode, the calibration mode is expressed as a Markov decision process; updating the detection box by a recursive reward or penalty learning process; the purpose of this section is to learn an optimal policy function that maps state S into behavior A;
in the field of computer vision, most of deep reinforcement learning adopts a method of taking a feature map as a state vector; however, a cluttered background can generate high activation values in the feature map, which can interfere with the calibration result and thus affect the human body posture estimation process; in the present invention, the environment state is defined as a tuple (h, i), where h is the historical decision vector from the decision network, and i is the masked feature map; by using a pre-trained convolutional neural network model f1Extracting an original feature map from the image x, and transmitting the feature map into a multitask network f2Taking the multitask network as a focus graph to extract a characteristic graph with a mask; the expression for i is as follows:
i=f2(f1(x))⊙f1(x)
wherein, the signal is a Hadamard product.
When an accurate foreground mask is used, redundant information in the feature map is removed as; the masked feature map provides low-level information such as shape, outline, high-level information such as pose of a human body; this facilitates the calibration process.
Multiplying the human body binary mask obtained by the detection network with a detection frame image which is adjusted to be matched with the full connection layer of the deep reinforcement learning network in size, and taking the result as the input of the deep reinforcement learning network; the output of the deep reinforcement learning network is the reward value of 11 detection box adjusting behaviors.
The detection box adjustment behavior includes four types: zooming behavior (including zooming out and zooming in), panning behavior (including panning in four directions, up, down, left, and right), termination behavior (whether to terminate the search box adjustment), and aspect ratio adjustment behavior (increase and decrease in the width direction and increase and decrease in the height direction). For the detector to produce relatively stable results, each action may be set to move the window by 0.1 times the current window size.
And selecting the behavior with the maximum reward value to adjust the detection frame, iteratively inputting the newly obtained detection frame image into the depth reinforcement learning network until the behavior with the maximum reward value is the termination behavior, and outputting the calibrated detection frame.
(1.3) the single person posture estimation network is specifically as follows: transmitting the mask and the calibrated detection frame image into a single posture estimation network to obtain a single posture;
(2) training a multi-person attitude estimation model by using a training sample; and inputting the image to be detected into the trained multi-character posture estimation model to obtain the character postures in all detection frames of the image to be detected.
Further, the detection network adopts a two-stage (two-stage) processing mode: in the first stage, extracting a feature map of an original image by using a deep residual error network and generating a plurality of candidate frames by an RPN (resilient packet network); and in the second stage, the candidate frame is transmitted into three branches for multi-task learning, and the classification confidence, the detection frame compensation value and the human body binary mask in the detection frame are respectively obtained.
Further, each branch of the detection network in the second stage adopts the following joint loss function:
L=Lcls1Lbox2Lmask
wherein L isclsFor classification loss, cross entropy expression is adopted; l isboxFor positioning loss, the difference between the detection frame and the real frame is measured by adopting an L1 norm; l ismaskFor the segmentation loss, mean binary cross-entropy representation is adopted α1And α2To balance the scaling factor of the three losses.
Furthermore, the deep reinforcement learning network comprises an 8 × 8 convolutional layer, a 4 × 4 convolutional layer and a 3 × 3 convolutional layer which are connected in sequence, the output of the 3 × 3 convolutional layer is provided with two branches, one branch obtains an 11-dimensional merit function A (a, s; theta, a) through an 11-dimensional full-connection layer, the other branch obtains a state value function V (s; theta, beta) through a 1-dimensional full-connection layer, wherein theta is a shared convolutional layer parameter, alpha and beta are respective full-connection layer parameters of the two branches, a is a detection frame adjustment behavior, and s is the input of the deep reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function. Q (s, a; θ, α, β) ═ V (s; θ, β) + a (a, s; θ, α).
Further, in the deep reinforcement learning network, the reward value r of the current iteration is expressed as follows:
r(s,a)=(IoU(w′i,gi)-IoU(wi,gi))+λb′/B′
wherein the first item IoU (w'i,gi)-IoU(wi,gi) For the traditional bonus term, the second term λ B '/B' is a regular term added to constrain the detection box size, wiAnd w'iG, representing the detection boxes of the object i before and after the action a, respectivelyiRepresenting a real box, B 'representing the area of the intersection region of the detection box after the action a and the real box, B' representing the area of the detection box after the action a, IoU representing the area of the intersection region, and λ being a scale factor for controlling a reward term and a regular term (the specific value of λ is determined when an experiment parameter is adjusted, and is generally 1-10).
The termination behavior is an additional behavior that does not move the detection box, but only determines whether the optimal result in reinforcement learning is found, and the reward value thereof is defined as follows:
Figure BDA0001775649880000031
wherein tau is the threshold value of the cross-over ratio to determine the positive and negative of the reward, and eta is the corresponding reward value.
The action a is selected according to a Q function, which expresses the accumulation of current and future prize values.
a=argmaxaQ(s,a)
Q(s,a)=r+γmaxa′Q(s′,a′)
The loss function loss (θ) of the training Q function is expressed as follows:
loss(θ)=E(r+γmaxa′Q(s′,a′,θ)-Q(s,a,θ))
wherein θ is a parameter of the deep reinforcement learning network, s and a are respectively an input and a detection frame adjustment behavior of the current iteration deep reinforcement learning network, s 'and a' are respectively an input and a detection frame adjustment behavior of the next iteration deep reinforcement learning network, Q (s, a, θ) is a sum of all reward values of the current iteration, Q (s ', a', θ) is a sum of all reward values from the next iteration, r is a reward value of the current iteration, γ is a discount factor, and E is an expectation of a loss value under all iterations.
Further, in the deep reinforcement learning network, in order to improve the learning efficiency of the parameter θ, the following method is adopted:
(a) in order to improve the learning stability, a target network is introduced, the target network is separated from the online network, the online network is updated in each iteration, and the target network is updated at intervals;
(b) in order to avoid trapping in the local minimum, a greedy strategy is adopted as an action strategy;
(c) to solve the data dependency problem, empirical playback (s, a, r, s') is used to store in a buffer from which a fixed number of samples are randomly selected during training to reduce the dependency between data.
Further, the deep reinforcement learning network employs a blanking DQN structure that can quickly identify correct behavior during decision evaluation. As with the labeled Q network, the blanking structure training only needs back propagation and does not need supervised learning or algorithm modification to automatically estimate V (s; theta, beta) and A (a, s; theta, alpha).
Further, the single-person posture estimation network combines a human body binary mask obtained by the detection network with a Cascaded Pyramid Network (CPN) to perform human body posture estimation, and a loss function of the single-person posture estimation network is as follows:
L=Linf+kLmask
wherein L isinfError terms for predicted single-person attitude and true attitude, LmaskK is a scale factor for balancing two terms (k is obtained according to practical experience and is generally 1-5) for representing a regular term of the predicted single posture and human body binary mask error; l ismask=∑pLpP is a human joint point number, wherein
Figure BDA0001775649880000041
Wherein,
Figure BDA0001775649880000042
the predicted value of the p node in the activation graph at the position l is shown, and l is the position with the maximum activation value in the activation graph; m islIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.
Further, the multi-character posture estimation model training stage adopts a GPU for calculation.
Preferably, the training details of the detection network are α in the loss function1And α2Take 4.0 and 10.0, respectively. The whole network adopts a random gradient descent algorithm with momentum of 0.9, and the weight attenuation is 0.0005. The learning rate is 0.01 for the first 6 ten thousand iterations and 0.001 for the last 2 ten thousand iterations. In each batch, 48 positive samples were taken from 4 trainsPicture, 48 negative examples from cluttered background. In the verification phase, the threshold confidence is set to 0.7 and the intersection ratio for localization is set to 0.6.
In the calibration process based on deep reinforcement learning, 10,000 data are taken as a buffer, and the data volume of a batch of data is 32. The lambda in the loss function is 1-10. In the experimental phase, a greedy strategy was used. In training, after 5,000 times of training, the number of training cycles was decreased from 0.3 to 0.05. The discount factor gamma is 0.9.
In the single posture estimation stage, k in the loss function is set to be 0.4, the model adopts a random gradient descent algorithm, the initial learning rate is 0.0005, and the learning rate of the data set is reduced by half after 10 times of traversal. The weight decay rate was 0.00005 and Batch Normalization (Batch Normalization) was used.
Compared with the prior art, the invention has the beneficial effects that:
1. the multi-person posture estimation method based on mask perception deep reinforcement learning provided by the invention increases the detection accuracy.
2. The masking information is used to eliminate the negative effects of cluttered background information and to select the best behavior according to the reward function.
3. In the human body posture estimation stage, a regularization term is added to punish nodes outside the human body contour.
4. The multi-character posture estimation model is tested on the MPII test set, the average accuracy (mAP) is improved by 1.1 compared with the prior art model, and the average accuracy reaches 73.0 when the multi-character posture estimation model is tested on the MS-COCO test development data set.
Drawings
FIG. 1 is a diagram illustrating the difficulty in estimating the pose of a plurality of human beings according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a multi-character pose estimation framework based on mask perceptual depth reinforcement learning according to an embodiment of the present invention;
FIG. 3 is an activation diagram of a detection block in a detection phase provided by an embodiment of the invention;
FIG. 4 is a schematic representation of different behaviors;
FIG. 5 is a schematic diagram of a deep Q network provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of a mask-aware pose estimation framework provided by an embodiment of the present invention;
FIG. 7 is a graph of the accuracy of the state versus reward function provided by an embodiment of the present invention;
FIG. 8 is a test result in an MPII dataset provided by an embodiment of the present invention.
Detailed Description
In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.
The multi-person posture estimation method provided by the embodiment can obtain the position and posture information of people with non-fixed number in one image, and can be applied to the multimedia industries of clinical analysis, man-machine interaction, behavior recognition and the like.
The method is adopted to obtain the positioning detection frame and the mask based on the multi-task learning network, calibrate and position by adopting the deep reinforcement learning network, and finally carry out human body posture estimation on the character of the detection frame by utilizing the single posture estimation network. The following describes embodiments of the present invention with reference to the drawings.
Fig. 1 is a schematic diagram of difficulty in estimating pose of multiple persons according to an embodiment of the present invention, where (a) indicates that the number and positions of persons in a picture are unknown. (b) And (c) the (d) respectively represents shielding, communication and contact, and shows the interaction between the people, and the (e) shows the relation between detection of the detection frame and estimation of the human body posture.
Fig. 2 is a schematic diagram of a multi-person posture estimation framework based on mask sensing depth reinforcement learning according to an embodiment of the present invention, where a multitask network synchronously obtains a detection frame and a person mask, and a positioning result is calibrated by using a depth reinforcement learning network. Finally, the pose of each character is estimated using the hourglass network. Both the calibration and estimation phases make use of the mask information.
Fig. 3 is an activation map in a detection frame in a detection stage provided by an embodiment of the present invention, and an activation map (b) is obtained by passing an original picture (a) through a convolutional neural network, where we can see that cluttered background information also generates a high activation value. Fig. 3(c) shows that when an accurate foreground mask is used, redundant information in the feature map will be removed.
Fig. 4 is a schematic diagram of different behaviors, which respectively shows 4 types of behaviors of a scaling behavior, a translation behavior, a termination behavior, and an aspect ratio adjustment behavior.
FIG. 5 is a schematic diagram of a deep Q network according to an embodiment of the present invention, including an 8 × 8 convolutional layer, a 4 × 4 convolutional layer, and a 3 × 3 convolutional layer connected in sequence, where an output of the 3 × 3 convolutional layer has two branches, one branch obtains an 11-dimensional dominance function A (a, s; θ, α) through an 11-dimensional fully-connected layer, and the other branch obtains a state value function V (s; θ, β) through a 1-dimensional fully-connected layer, where θ is a shared convolutional layer parameter, α, β are respective fully-connected layer parameters of the two branches, a is a detection frame adjustment behavior, and s is an input of the deep reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function.
Fig. 6 is a schematic diagram of a posture estimation framework combining masks according to an embodiment of the present invention, where in the single posture estimation network, a human body binary mask obtained by a detection network is combined with a Cascaded Pyramid Network (CPN) to perform human body posture estimation, and a loss function of the single posture estimation network is as follows:
L=Linf+kLmask
wherein L isinfError terms for predicted single-person attitude and true attitude, LmaskK is a scale factor for balancing two terms (k is obtained according to practical experience and is generally 1-5) for representing a regular term of the predicted single posture and human body binary mask error; l ismask=∑pLpP is a human joint point number, wherein
Figure BDA0001775649880000061
Wherein,
Figure BDA0001775649880000062
representing the predicted value of p node in the activation graph at the position l, wherein l is the position of the p node in the activation graphThe position where the activation value is maximum; m islIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.
FIG. 7 is a graph of the accuracy of a state versus a reward function, where (a) is a training accuracy of the state, b is a testing accuracy of the state, c is a training accuracy of the reward function, and d is a testing accuracy of the reward function.
The experimental results on the MPII data set are shown in fig. 8, where (a) shows the result of successful prediction, and (b) shows the result of failed prediction. From the results of the prediction failure, we can conclude that (1) although the detection method is improved, the self-term down method is still affected by early commitment (earlylimit). (2) Our method is applicable to situations where people are present in the predicted area with less interaction.
The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A multi-person posture estimation method based on mask perception deep reinforcement learning is characterized by comprising the following steps:
(1) constructing a multi-person attitude estimation model: the multi-character posture estimation model consists of a detection network for acquiring a detection frame and a mask, a deep reinforcement learning network for improving positioning accuracy and a single posture estimation network;
the detection network specifically comprises the following steps: obtaining a detection frame of an original image and a human body binary mask in the detection frame through a multitask learning network;
the deep reinforcement learning network specifically comprises the following steps: multiplying the human body binary mask obtained by the detection network with a detection frame image matched with a full connection layer of the deep reinforcement learning network, and taking the multiplied human body binary mask as the input of the deep reinforcement learning network; the output of the deep reinforcement learning network is the reward value of the adjustment behavior of 11 detection frames; the detection frame adjustment behavior includes four types: a zoom behavior, a pan behavior, a terminate behavior, an aspect ratio adjustment behavior; selecting the behavior with the maximum reward value to adjust the detection frame, iteratively inputting the newly obtained detection frame image into the depth reinforcement learning network until the behavior with the maximum reward value is the termination behavior, and outputting the calibrated detection frame;
the single-person posture estimation network specifically comprises the following steps: transmitting the mask and the calibrated detection frame image into a single posture estimation network to obtain a single posture;
(2) training a multi-person attitude estimation model by using a training sample; and inputting the image to be detected into the trained multi-character posture estimation model to obtain the character postures in all detection frames of the image to be detected.
2. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 1, wherein the detection network adopts a two-stage processing mode: in the first stage, extracting a feature map of an original image by using a deep residual error network and generating a plurality of candidate frames by an RPN (resilient packet network); and in the second stage, the candidate frame is transmitted into three branches for multi-task learning, and the classification confidence, the detection frame compensation value and the human body binary mask in the detection frame are respectively obtained.
3. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 2, wherein each branch of the detection network at the second stage adopts the following joint loss function:
L=Lcls1Lbox2Lmask
wherein L isclsFor classification loss, cross entropy expression is adopted; l isboxFor positioning loss, the difference between the detection frame and the real frame is measured by adopting an L1 norm; l ismaskFor the segmentation loss, mean binary cross-entropy representation is adopted α1And α2To balance the scaling factor of the three losses.
4. The mask sensing depth reinforcement learning-based multi-person posture estimation method according to claim 1, wherein the depth reinforcement learning network comprises an 8 x 8 convolutional layer, a 4 x 4 convolutional layer and a 3 x 3 convolutional layer which are connected in sequence, the output of the 3 x 3 convolutional layer has two branches, one branch obtains an 11-dimensional dominant function A (a, s; theta, alpha) through an 11-dimensional full-connection layer, and the other branch obtains a state value function V (s; theta, beta) through a 1-dimensional full-connection layer, wherein theta is a shared convolutional layer parameter of the depth reinforcement learning network, alpha, beta are respective full-connection layer parameters of the two branches, a is a detection frame adjustment behavior, and s is an input of the depth reinforcement learning network; and adding the advantage function and the state value function to obtain a Q function, and calculating the reward value of each behavior through the Q function.
5. The mask aware depth-based reinforcement learning multi-character pose estimation method according to claim 4, wherein in the depth-based reinforcement learning network, the loss function loss (θ) of the Q function is expressed as follows:
loss(θ)=E(r+γmaxa′Q(s′,a′,θ)-Q(s,a,θ))
wherein θ is a shared convolution layer parameter of the deep reinforcement learning network, s and a are respectively an input and detection frame adjustment behavior of the current iteration deep reinforcement learning network, s 'and a' are respectively an input and detection frame adjustment behavior of the next iteration deep reinforcement learning network, Q (s, a, θ) is a sum of all reward values at the beginning of the current iteration, Q (s ', a', θ) is a sum of all reward values from the beginning of the next iteration, r is a reward value of the current iteration, γ is a discount factor, and E is an expectation of loss values under all iterations.
6. The mask-aware deep reinforcement learning-based multi-character pose estimation method according to claim 5, wherein in the deep reinforcement learning network, the reward value r of the current iteration is expressed as follows:
r(s,a)=(IoU(w′i,gi)-IoU(wi,gi))+λb′/B′
wherein the first item IoU (w'i,gi)-IoU(wi,gi) For the traditional bonus term, the second term λ B '/B' is a regular term added to constrain the detection box size, wiAnd w'iG, representing the detection boxes of the object i before and after the action a, respectivelyiRepresenting the real box, B 'representing the area of the intersection region of the detection box after action a and the real box, B' representing the area of the detection box after action a, IoU representing the area of the intersection region, and λ being the scale factor controlling the reward term and the regularization term.
7. The method for estimating poses of multiple human beings based on mask perception deep reinforcement learning according to claim 5, characterized in that, in the deep reinforcement learning network, in order to improve the learning efficiency of the parameter θ, the following way is adopted:
(a) in order to improve the learning stability, a target network is introduced, the target network is separated from the online network, the online network is updated in each iteration, and the target network is updated at intervals;
(b) in order to avoid trapping in the local minimum, a greedy strategy is adopted as an action strategy;
(c) to address the data correlation problem, (s, a, r, s') is stored in a buffer using empirical playback, and a fixed number of samples are randomly selected from the buffer during training to reduce the correlation between data.
8. The mask aware depth-based reinforcement learning multi-character pose estimation method as claimed in claim 1, wherein the depth-based reinforcement learning network employs a blanking DQN structure that can quickly identify correct behavior during decision evaluation.
9. The mask-aware deep reinforcement learning-based multi-character posture estimation method as claimed in claim 1, wherein the single posture estimation network combines a human body binary mask obtained by the detection network with the cascaded pyramid network to perform human body posture estimation, and a loss function of the single posture estimation network is as follows:
L=Linf+kLmask
wherein L isinfError terms for predicted single-person attitude and true attitude, LmaskK is a scale factor balancing two terms, which is a regular term representing the predicted single-person posture and human body binary mask error; l ismask=∑pLpP is the number of the human body joint point;
Figure FDA0002533770540000031
wherein,
Figure FDA0002533770540000032
representing the predicted value of the p node at the l position in the activation graph,
Figure FDA0002533770540000033
the position with the maximum activation value in the activation map; m islIs a binary mask of the human body at the position l, 1 is indicated in the human body area, and 0 is indicated in the background area; if the node is not in the human body area, the result is punished, otherwise the loss function is not influenced.
10. The multi-character pose estimation method based on mask perception deep reinforcement learning of claim 1, wherein the multi-character pose estimation model training phase adopts a GPU for calculation.
CN201810968949.9A 2018-08-23 2018-08-23 Mask perception depth reinforcement learning-based multi-person attitude estimation method Active CN109190537B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810968949.9A CN109190537B (en) 2018-08-23 2018-08-23 Mask perception depth reinforcement learning-based multi-person attitude estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810968949.9A CN109190537B (en) 2018-08-23 2018-08-23 Mask perception depth reinforcement learning-based multi-person attitude estimation method

Publications (2)

Publication Number Publication Date
CN109190537A CN109190537A (en) 2019-01-11
CN109190537B true CN109190537B (en) 2020-09-29

Family

ID=64919381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810968949.9A Active CN109190537B (en) 2018-08-23 2018-08-23 Mask perception depth reinforcement learning-based multi-person attitude estimation method

Country Status (1)

Country Link
CN (1) CN109190537B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766887B (en) * 2019-01-16 2022-11-11 中国科学院光电技术研究所 Multi-target detection method based on cascaded hourglass neural network
CN109784296A (en) * 2019-01-27 2019-05-21 武汉星巡智能科技有限公司 Bus occupant quantity statistics method, device and computer readable storage medium
CN110008915B (en) * 2019-04-11 2023-02-03 电子科技大学 System and method for estimating dense human body posture based on mask-RCNN
CN110188219B (en) * 2019-05-16 2023-01-06 复旦大学 Depth-enhanced redundancy-removing hash method for image retrieval
CN110222636B (en) * 2019-05-31 2023-04-07 中国民航大学 Pedestrian attribute identification method based on background suppression
CN110210402B (en) * 2019-06-03 2021-11-19 北京卡路里信息技术有限公司 Feature extraction method and device, terminal equipment and storage medium
CN110197163B (en) * 2019-06-04 2021-02-12 中国矿业大学 Target tracking sample expansion method based on pedestrian search
CN110415332A (en) * 2019-06-21 2019-11-05 上海工程技术大学 Complex textile surface three dimensional reconstruction system and method under a kind of non-single visual angle
CN110335291A (en) * 2019-07-01 2019-10-15 腾讯科技(深圳)有限公司 Personage's method for tracing and terminal
CN112184802B (en) * 2019-07-05 2023-10-20 杭州海康威视数字技术股份有限公司 Calibration frame adjusting method, device and storage medium
CN112241976B (en) * 2019-07-19 2024-08-27 杭州海康威视数字技术股份有限公司 Model training method and device
CN110569719B (en) * 2019-07-30 2022-05-17 中国科学技术大学 Animal head posture estimation method and system
CN110866872B (en) * 2019-10-10 2022-07-29 北京邮电大学 Pavement crack image preprocessing intelligent selection method and device and electronic equipment
CN111415389B (en) * 2020-03-18 2023-08-29 清华大学 Label-free six-dimensional object posture prediction method and device based on reinforcement learning
CN111738091A (en) * 2020-05-27 2020-10-02 复旦大学 Posture estimation and human body analysis system based on multi-task deep learning
CN111695457B (en) * 2020-05-28 2023-05-09 浙江工商大学 Human body posture estimation method based on weak supervision mechanism
CN112052886B (en) * 2020-08-21 2022-06-03 暨南大学 Intelligent human body action posture estimation method and device based on convolutional neural network
CN112884780A (en) * 2021-02-06 2021-06-01 罗普特科技集团股份有限公司 Estimation method and system for human body posture
CN113012229A (en) * 2021-03-26 2021-06-22 北京华捷艾米科技有限公司 Method and device for positioning human body joint points
CN113361570B (en) * 2021-05-25 2022-11-01 东南大学 3D human body posture estimation method based on joint data enhancement and network training model
CN113436633B (en) * 2021-06-30 2024-03-12 平安科技(深圳)有限公司 Speaker recognition method, speaker recognition device, computer equipment and storage medium
CN113537070B (en) * 2021-07-19 2022-11-22 中国第一汽车股份有限公司 Detection method, detection device, electronic equipment and storage medium
CN114143710B (en) * 2021-11-22 2022-10-04 武汉大学 Wireless positioning method and system based on reinforcement learning
CN116721471A (en) * 2023-08-10 2023-09-08 中国科学院合肥物质科学研究院 Multi-person three-dimensional attitude estimation method based on multi-view angles

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150544A (en) * 2011-08-30 2013-06-12 精工爱普生株式会社 Method and apparatus for object pose estimation
CN106780536A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of shape based on object mask network perceives example dividing method
CN106780569A (en) * 2016-11-18 2017-05-31 深圳市唯特视科技有限公司 A kind of human body attitude estimates behavior analysis method
CN106897697A (en) * 2017-02-24 2017-06-27 深圳市唯特视科技有限公司 A kind of personage and pose detection method based on visualization compiler
CN106951512A (en) * 2017-03-17 2017-07-14 深圳市唯特视科技有限公司 A kind of end-to-end session control method based on hybrid coding network
CN107392118A (en) * 2017-07-04 2017-11-24 竹间智能科技(上海)有限公司 The recognition methods of reinforcing face character and the system of generation network are resisted based on multitask
CN107944443A (en) * 2017-11-16 2018-04-20 深圳市唯特视科技有限公司 One kind carries out object consistency detection method based on end-to-end deep learning
CN108256489A (en) * 2018-01-24 2018-07-06 清华大学 Behavior prediction method and device based on deeply study
CN108304795A (en) * 2018-01-29 2018-07-20 清华大学 Human skeleton Activity recognition method and device based on deeply study

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11004358B2 (en) * 2015-04-22 2021-05-11 Jeffrey B. Matthews Visual and kinesthetic method and educational kit for solving algebraic linear equations involving an unknown variable
US10607342B2 (en) * 2016-09-30 2020-03-31 Siemenes Healthcare GmbH Atlas-based contouring of organs at risk for radiation therapy

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150544A (en) * 2011-08-30 2013-06-12 精工爱普生株式会社 Method and apparatus for object pose estimation
CN106780569A (en) * 2016-11-18 2017-05-31 深圳市唯特视科技有限公司 A kind of human body attitude estimates behavior analysis method
CN106780536A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of shape based on object mask network perceives example dividing method
CN106897697A (en) * 2017-02-24 2017-06-27 深圳市唯特视科技有限公司 A kind of personage and pose detection method based on visualization compiler
CN106951512A (en) * 2017-03-17 2017-07-14 深圳市唯特视科技有限公司 A kind of end-to-end session control method based on hybrid coding network
CN107392118A (en) * 2017-07-04 2017-11-24 竹间智能科技(上海)有限公司 The recognition methods of reinforcing face character and the system of generation network are resisted based on multitask
CN107944443A (en) * 2017-11-16 2018-04-20 深圳市唯特视科技有限公司 One kind carries out object consistency detection method based on end-to-end deep learning
CN108256489A (en) * 2018-01-24 2018-07-06 清华大学 Behavior prediction method and device based on deeply study
CN108304795A (en) * 2018-01-29 2018-07-20 清华大学 Human skeleton Activity recognition method and device based on deeply study

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Canonical Locality Preserving Latent Variable Model for Discriminative Pose Inference;Yan Tian 等;《Image and Vision Computing》;20130331;第223-230页 *
DNN-BASED SOURCE ENHANCEMENT SELF-OPTIMIZED BY REINFORCEMENT LEARNING USING SOUND QUALITY MEASUREMENTS;Koizumi, Yuma 等;《2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》;20170509;第81-85页 *
图像和视频中基于部件检测器的人体姿态估计;苏延超 等;《电子与信息学报》;20110630;第33卷(第6期);第1413-1419页 *
目标跟踪算法综述;卢湖川 等;《模式识别与人工智能》;20180131;第31卷(第1期);第61-76页 *

Also Published As

Publication number Publication date
CN109190537A (en) 2019-01-11

Similar Documents

Publication Publication Date Title
CN109190537B (en) Mask perception depth reinforcement learning-based multi-person attitude estimation method
US11468697B2 (en) Pedestrian re-identification method based on spatio-temporal joint model of residual attention mechanism and device thereof
CN108154118B (en) A kind of target detection system and method based on adaptive combined filter and multistage detection
CN107529650B (en) Closed loop detection method and device and computer equipment
US10957309B2 (en) Neural network method and apparatus
CN106570464B (en) Face recognition method and device for rapidly processing face shielding
CN111310672A (en) Video emotion recognition method, device and medium based on time sequence multi-model fusion modeling
CN113570029A (en) Method for obtaining neural network model, image processing method and device
CN113705769A (en) Neural network training method and device
CN108182260B (en) Multivariate time sequence classification method based on semantic selection
CN104504366A (en) System and method for smiling face recognition based on optical flow features
CN110716792B (en) Target detector and construction method and application thereof
CN111598183A (en) Multi-feature fusion image description method
CN113592060A (en) Neural network optimization method and device
CN113407820B (en) Method for processing data by using model, related system and storage medium
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN113781517A (en) System and method for motion estimation
CN113326735A (en) Multi-mode small target detection method based on YOLOv5
CN114140885A (en) Emotion analysis model generation method and device, electronic equipment and storage medium
CN109345559B (en) Moving target tracking method based on sample expansion and depth classification network
US20050114103A1 (en) System and method for sequential kernel density approximation through mode propagation
CN116486233A (en) Target detection method for multispectral double-flow network
CN112989952B (en) Crowd density estimation method and device based on mask guidance
KR20230141828A (en) Neural networks using adaptive gradient clipping
CN112053386B (en) Target tracking method based on depth convolution characteristic self-adaptive integration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210929

Address after: 310000 Room 401, building 2, No.16, Zhuantang science and technology economic block, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: Hangzhou yunqi smart Vision Technology Co., Ltd

Address before: 310018, No. 18 Jiao Tong Street, Xiasha Higher Education Park, Hangzhou, Zhejiang

Patentee before: ZHEJIANG GONGSHANG University

TR01 Transfer of patent right