[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2964284.2964322acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning

Published: 01 October 2016 Publication History

Abstract

Human pose estimation (i.e., locating the body parts / joints of a person) is a fundamental problem in human-computer interaction and multimedia applications. Significant progress has been made based on the development of depth sensors, i.e., accessible human pose prediction from still depth images~\cite{rf12pami}. However, most of the existing approaches to this problem involve several components/models that are independently designed and optimized, leading to suboptimal performances. In this paper, we propose a novel inference-embedded multi-task learning framework for predicting human pose from still depth images, which is implemented with a deep architecture of neural networks. Specifically, we handle two cascaded tasks: i) generating the heat (confidence) maps of body parts via a fully convolutional network (FCN); ii) seeking the optimal configuration of body parts based on the detected body part proposals via an inference built-in MatchNet~\cite{mn15cvpr}, which measures the appearance and geometric kinematic compatibility of body parts and embodies the dynamic programming inference as an extra network layer. These two tasks are jointly optimized. Our extensive experiments show that the proposed deep model significantly improves the accuracy of human pose estimation over other several state-of-the-art methods or SDKs. We also release a large-scale dataset for comparison, which includes 100K depth images under challenging scenarios.

References

[1]
X. Chen and A. L. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems (NIPS), pages 1736--1744, 2014.
[2]
X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature learning for pose estimation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[3]
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(9):1627--1645, 2010.
[4]
P. F. Felzenszwalb and D. P. Huttenlocher. Efficient matching of pictorial structures. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 66--73, 2000.
[5]
T. Finley and T. Joachims. Training structural svms when exact inference is intractable. In Proceedings of the International Conference on Machine Learning (ICML), pages 304--311, 2008.
[6]
M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on Computers, C-22(1):67--92, 1973.
[7]
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[8]
R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In Proc. of IEEE International Conference on Computer Vision (ICCV), pages 415--422, 2011.
[9]
K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3d structure with a statistical image-based shape model. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2003.
[10]
X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3279--3286, 2015.
[11]
C. Hong, J. Yu, D. Tao, and M. Wang. Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing (TIP), 2016.
[12]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[13]
S. Johnson and M. Everingham. Learning effective human pose estimation from inaccurate annotation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1465--1472, 2011.
[14]
H. Y. Jung, S. Lee, Y. S. Heo, and I. D. Yun. Random tree walk toward instantaneous 3d human pose estimation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2467--2474, 2015.
[15]
M. Kiefel and P. V. Gehler. Human pose estimation with fields of parts. In Proc. of European Conference on Computer Vision (ECCV), 2014.
[16]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), 2012.
[17]
X. Liang, L. Lin, and L. Cao. Learning latent spatio-temporal compositional model for human action recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 263--272, 2013.
[18]
X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, L. Lin, and S. Yan. Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 37(12):2402--2414, 2015.
[19]
X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin, and S. Yan. Towards computational baby learning: A weakly-supervised approach for object detection. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2015.
[20]
X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan. Semantic object parsing with local-global long short-term memory. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[21]
X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin*, and S. Yan. Human parsing with contextualized convolutional neural network. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2015.
[22]
Z. Liang, X. Wang, R. Huang, and L. Lin. An expressive deep model for parsing human action from a single image. In Proc. of IEEE International Conference on Multimedia and Expo (ICME), 2014.
[23]
L. Lin, G. Wang, R. Zhang, R. Zhang, X. Liang, and W. Zuo. Deep structured scene parsing by learning with image descriptions. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[24]
L. Lin, G. Wang, W. Zuo, X. Feng, and L. Zhang. Cross-domain visual matching via generalized similarity measure and feature learning. IEEE Transctions on Pattern Analysis and Machine Intelligence (T-PAMI), 2016.
[25]
L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, and L. Zhang. A deep structured model with radius-margin bound for 3d human activity recognition. International Journal of Computer Vision (IJCV), 118(2):256--273, 2016.
[26]
L. Lin, X. Wang, W. Yang, and J. H. Lai. Discriminatively trained and-or graph models for object shape detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(5):959--972, 2015.
[27]
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[28]
G. Mori and J. Malik. Estimating human body configurations using shape context matching, 2002.
[29]
J. O'Rourke and N. I. Badler. Model-based image analysis of human motion using constraint propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1980.
[30]
Z. Peng, R. Zhang, X. Liang, X. Liu, and L. Lin. Geometric scene parsing with hierarchical lstm. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), 2016.
[31]
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1297--1304, 2011.
[32]
J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, and A. Blake. Efficient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(12):2821--2840, 2013.
[33]
J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore. Real-time human pose recognition in parts from single depth images. Commun. ACM, 56(1):116--124, 2013.
[34]
M. Sun, P. Kohli, and J. Shotton. Conditional regression forests for human pose estimation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3394--3401, 2012.
[35]
B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct prediction of 3d body poses from motion compensated sequences. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[36]
O. Tian, C. L. Zitnick, and S. G. Narasimhan. Exploring the spatial hierarchy of mixture models for human pose estimation. In Proc. of European Conference on Computer Vision (ECCV), 2012.
[37]
J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648--656, 2015.
[38]
J. Tompson, A. Jain, Y. Lecun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems (NIPS), pages 1799--1807, 2014.
[39]
A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1653--1660, 2014.
[40]
C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao. Robust estimation of 3d human poses from a single image. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2369--2376, 2014.
[41]
F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of single-image and cross-image representations for person re-identification. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[42]
G. Wang, L. Lin, S. Ding, Y. Li, and Q. Wang. Dari: Distance metric and representation integration for person verification. In Proc. of AAAI Conference on Artificial Intelligence (AAAI), 2016.
[43]
K. Wang, L. Lin, W. Zuo, S. Gu, and L. Zhang. Dictionary pair classifier driven convolutional neural networksfor object detection. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[44]
K. Wang, X. Wang, L. Lin, M. Wang, and W. Zuo. 3d human activity recognition with reconfigurable convolutional neural networks. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 97--106, 2014.
[45]
K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology (TCVST), 2016.
[46]
M. Wang, X. Liu, and X. Wu. Visual classification by l1-hypergraph modeling. IEEE Transactions on Knowledge and Data Engineering, 27(9):2564--2574, 2015.
[47]
W. Y. P. L. J. H. Xiaodan Liang, Liang Lin and S. Yan. Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval. IEEE Transactions on Multimedia (T-MM), 18(6):1175--1186, 2016.
[48]
Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(12):2878--2890, 2013.
[49]
A. L. Yuille and A. Rangarajan. The concave-convex procedure (cccp). In Advances in Neural Information Processing Systems (NIPS), 2002.
[50]
R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Transactions on Image Processing (T-IP), 24(12):4766--4779, 2015.

Cited By

View all
  • (2024)RGB-D Fusion Through Zero-Shot Fuzzy Membership Learning for Salient Object DetectionIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.33766405:7(3638-3652)Online publication date: Jul-2024
  • (2024)LiDAR-Based 3-D Human Pose Estimation and Action Recognition for Medical ScenesIEEE Sensors Journal10.1109/JSEN.2024.337319224:9(15531-15539)Online publication date: 1-May-2024
  • (2024)EgoGen: An Egocentric Synthetic Data Generator2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01374(14497-14509)Online publication date: 16-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. human pose estimation
  3. multi-task learning

Qualifiers

  • Research-article

Funding Sources

  • Guangdong Natural Science Foundation
  • Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase)
  • the Fundamental Research Funds for the Central Universities
  • State Key Development Program

Conference

MM '16
Sponsor:
MM '16: ACM Multimedia Conference
October 15 - 19, 2016
Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)5
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)RGB-D Fusion Through Zero-Shot Fuzzy Membership Learning for Salient Object DetectionIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.33766405:7(3638-3652)Online publication date: Jul-2024
  • (2024)LiDAR-Based 3-D Human Pose Estimation and Action Recognition for Medical ScenesIEEE Sensors Journal10.1109/JSEN.2024.337319224:9(15531-15539)Online publication date: 1-May-2024
  • (2024)EgoGen: An Egocentric Synthetic Data Generator2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01374(14497-14509)Online publication date: 16-Jun-2024
  • (2024)SPiKE: 3D Human Pose from Point Cloud SequencesPattern Recognition10.1007/978-3-031-78456-9_30(470-486)Online publication date: 3-Dec-2024
  • (2023)Learning dynamic relationship between joints for 3D hand pose estimation from single depth mapJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10380392(103803)Online publication date: Apr-2023
  • (2022)WPL-Based Constraint for 3D Human Pose Estimation from a Single Depth ImageSensors10.3390/s2223904022:23(9040)Online publication date: 22-Nov-2022
  • (2022)PoP-Net: Pose over Parts Network for Multi-Person 3D Pose Estimation from a Depth Image2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV51458.2022.00397(3917-3926)Online publication date: Jan-2022
  • (2022)3D human pose estimation with cross-modality training and multi-scale local refinementApplied Soft Computing10.1016/j.asoc.2022.108950122(108950)Online publication date: Jun-2022
  • (2021)Lifting Posture Prediction With Generative Models for Improving Occupational SafetyIEEE Transactions on Human-Machine Systems10.1109/THMS.2021.310251151:5(494-503)Online publication date: Oct-2021
  • (2021)A Distortion-Aware Multi-Task Learning Framework for Fractional Interpolation in Video CodingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.302833031:7(2824-2836)Online publication date: Jul-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media