Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

1352 Accesses
17 Citations
Explore all metrics

Abstract

This paper focuses on structured-output learning using deep neural networks for 3D human pose estimation from monocular images. Our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We also propose an efficient recurrent neural network for performing inference with the learned image-embedding. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

On the Role of Depth Predictions for 3D Human Pose Estimation

2D Human pose estimation: a survey

Article 11 November 2022

A lightweight convolutional neural network for pose estimation of a planar model

Article 31 March 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Note that \({\hat{y}}\) depends on the input (x, y) and network parameters \(\theta \). To reduce clutter, we write \({\hat{y}}\) instead of \({\hat{y}}(x,y,\theta )\) when no confusion arises.
The action “Direction” is not included due to video corruption.
For better visualization, we only use the images from a single subject.

References

Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. ICML, 28, 1247–1255.
Google Scholar
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3686–3693).
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., & Bengio, Y. (2012). Theano: new features and speed improvements. In NIPS: Deep learning and unsupervised feature learning workshop
Bengio, Y., Mesnil, G., Dauphin, Y., & Rifai, S. (2013). Better mixing via deep representations. In ICML (pp. 552–560).
Bregler, C., Malik, J., & Pullen, K. (2004). Twist based acquisition and tracking of animal and human kinematics. International Journal of Computer Vision, 56(3), 179–194.
Article Google Scholar
Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3d pictorial structures for multiple view articulated pose estimation. In CVPR (pp. 3618–3625).
Calamai, P. H., & Moré, J. J. (1987). Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1), 93–116.
Article MathSciNet MATH Google Scholar
Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In The IEEE conference on computer vision and pattern recognition (CVPR)
Chen, X. & Yuille, A. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS
Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recurrent neural network for immediacy prediction. In The IEEE international conference on computer vision (ICCV) (pp. 3352–3360).
Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. IJCV, 61(2), 185–205.
Article Google Scholar
Dhungel, N., Carneiro, G., & Bradley, A. P. (2014). Deep structured learning for mass segmentation from mammograms. CoRR arXiv:1410.7454
Eichner, M. & Ferrari, V. (2009). Better appearance models for pictorial structures. In BMVC (pp 1–11)
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. IJCV, 61(1), 55–79.
Article Google Scholar
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International conference on learning representations
Ionescu, C., Bo, L., & Sminchisescu, C. (2009). Structural SVM for visual localization and continuous state estimation. In ICCV (pp. 1157–1164).
Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In ICCV (pp. 2220–2227).
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7), 1325–1339.
Article Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2015). Deep structured output learning for unconstrained text recognition. ICLR
Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks. In ICLR
Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1), 27–59.
Article MATH Google Scholar
Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. Cambridge: MIT Press.
MATH Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS
Li, S. & Chan, A. B. (2014). 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV
Li, S., Liu, Z. Q., & Chan, A. B. (2014). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In IJCV (pp 1–18).
Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3d human pose estimation. In The IEEE international conference on computer vision (ICCV)
Murray, R. M., Li, Z., & Sastry, S. S. (1994). A mathematical introduction to robotic manipulation (Vol. 29). Boca Raton: CRC press.
MATH Google Scholar
Nair, V. & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In ICML
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In ICML (pp. 689–696)
Osadchy, M., LeCun, Y., & Miller, M. L. (2007). Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8, 1197–1215.
Google Scholar
Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR (pp. 512–519)
Rodríguez, J. A. & Perronnin, F. (2013). Label embedding for text recognition. In BMVC
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Neurocomputing: Foundations of research, Chap Learning representations by back-propagating errors (pp. 696–699). Cambridge, MA: MIT Press.
Sapp, B. & Taskar, B. (2013). Modec: Multimodal decomposablemodels for human pose estimation. In Proceedings of the IEEE conference on CVPR
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR arXiv:1312.6229
Srivastava, N. & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In NIPS (pp. 2222–2230). Curran Associates Inc., Red Hook.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
MathSciNet MATH Google Scholar
Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation from predicting 10,000 classes. In CVPR, IEEE Computer Society
Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS
Toshev, A. & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.
MathSciNet MATH Google Scholar
Yang, Y. & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR (pp. 1385 – 1392)
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV)

Download references

Acknowledgments

This work was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 123212), and by a Strategic Research Grant from City University of Hong Kong (Project Nos. 7004417 and 7004682). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
Sijin Li, Weichen Zhang & Antoni B. Chan

Authors

Sijin Li
View author publications
You can also search for this author in PubMed Google Scholar
Weichen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Antoni B. Chan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sijin Li.

Additional information

Communicated by Deva Ramanan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, S., Zhang, W. & Chan, A.B. Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation. Int J Comput Vis 122, 149–168 (2017). https://doi.org/10.1007/s11263-016-0962-x

Download citation

Received: 28 February 2016
Accepted: 20 September 2016
Published: 01 October 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s11263-016-0962-x

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Role of Depth Predictions for 3D Human Pose Estimation

2D Human pose estimation: a survey

A lightweight convolutional neural network for pose estimation of a planar model

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Role of Depth Predictions for 3D Human Pose Estimation

2D Human pose estimation: a survey

A lightweight convolutional neural network for pose estimation of a planar model

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation