More Web Proxy on the site http://driver.im/

research-article

Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning

Authors:

Liang LinAuthors Info & Claims

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 1227 - 1236

https://doi.org/10.1145/2964284.2964322

Published: 01 October 2016 Publication History

Abstract

Human pose estimation (i.e., locating the body parts / joints of a person) is a fundamental problem in human-computer interaction and multimedia applications. Significant progress has been made based on the development of depth sensors, i.e., accessible human pose prediction from still depth images~\cite{rf12pami}. However, most of the existing approaches to this problem involve several components/models that are independently designed and optimized, leading to suboptimal performances. In this paper, we propose a novel inference-embedded multi-task learning framework for predicting human pose from still depth images, which is implemented with a deep architecture of neural networks. Specifically, we handle two cascaded tasks: i) generating the heat (confidence) maps of body parts via a fully convolutional network (FCN); ii) seeking the optimal configuration of body parts based on the detected body part proposals via an inference built-in MatchNet~\cite{mn15cvpr}, which measures the appearance and geometric kinematic compatibility of body parts and embodies the dynamic programming inference as an extra network layer. These two tasks are jointly optimized. Our extensive experiments show that the proposed deep model significantly improves the accuracy of human pose estimation over other several state-of-the-art methods or SDKs. We also release a large-scale dataset for comparison, which includes 100K depth images under challenging scenarios.

References

[1]

X. Chen and A. L. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems (NIPS), pages 1736--1744, 2014.

Digital Library

[2]

X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature learning for pose estimation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[3]

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(9):1627--1645, 2010.

Digital Library

[4]

P. F. Felzenszwalb and D. P. Huttenlocher. Efficient matching of pictorial structures. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 66--73, 2000.

[5]

T. Finley and T. Joachims. Training structural svms when exact inference is intractable. In Proceedings of the International Conference on Machine Learning (ICML), pages 304--311, 2008.

Digital Library

[6]

M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on Computers, C-22(1):67--92, 1973.

Digital Library

[7]

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

Digital Library

[8]

R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In Proc. of IEEE International Conference on Computer Vision (ICCV), pages 415--422, 2011.

Digital Library

[9]

K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3d structure with a statistical image-based shape model. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2003.

Digital Library

[10]

X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3279--3286, 2015.

[11]

C. Hong, J. Yu, D. Tao, and M. Wang. Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing (TIP), 2016.

[12]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[13]

S. Johnson and M. Everingham. Learning effective human pose estimation from inaccurate annotation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1465--1472, 2011.

Digital Library

[14]

H. Y. Jung, S. Lee, Y. S. Heo, and I. D. Yun. Random tree walk toward instantaneous 3d human pose estimation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2467--2474, 2015.

[15]

M. Kiefel and P. V. Gehler. Human pose estimation with fields of parts. In Proc. of European Conference on Computer Vision (ECCV), 2014.

[16]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), 2012.

Digital Library

[17]

X. Liang, L. Lin, and L. Cao. Learning latent spatio-temporal compositional model for human action recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 263--272, 2013.

Digital Library

[18]

X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, L. Lin, and S. Yan. Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 37(12):2402--2414, 2015.

Digital Library

[19]

X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin, and S. Yan. Towards computational baby learning: A weakly-supervised approach for object detection. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2015.

Digital Library

[20]

X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan. Semantic object parsing with local-global long short-term memory. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[21]

X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin*, and S. Yan. Human parsing with contextualized convolutional neural network. In Proc. of IEEE International Conference on Computer Vision (ICCV), 2015.

Digital Library

[22]

Z. Liang, X. Wang, R. Huang, and L. Lin. An expressive deep model for parsing human action from a single image. In Proc. of IEEE International Conference on Multimedia and Expo (ICME), 2014.

[23]

L. Lin, G. Wang, R. Zhang, R. Zhang, X. Liang, and W. Zuo. Deep structured scene parsing by learning with image descriptions. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[24]

L. Lin, G. Wang, W. Zuo, X. Feng, and L. Zhang. Cross-domain visual matching via generalized similarity measure and feature learning. IEEE Transctions on Pattern Analysis and Machine Intelligence (T-PAMI), 2016.

[25]

L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, and L. Zhang. A deep structured model with radius-margin bound for 3d human activity recognition. International Journal of Computer Vision (IJCV), 118(2):256--273, 2016.

Digital Library

[26]

L. Lin, X. Wang, W. Yang, and J. H. Lai. Discriminatively trained and-or graph models for object shape detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(5):959--972, 2015.

Digital Library

[27]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[28]

G. Mori and J. Malik. Estimating human body configurations using shape context matching, 2002.

[29]

J. O'Rourke and N. I. Badler. Model-based image analysis of human motion using constraint propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1980.

[30]

Z. Peng, R. Zhang, X. Liang, X. Liu, and L. Lin. Geometric scene parsing with hierarchical lstm. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), 2016.

[31]

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1297--1304, 2011.

Digital Library

[32]

J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, and A. Blake. Efficient human pose estimation from single depth images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(12):2821--2840, 2013.

Digital Library

[33]

J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore. Real-time human pose recognition in parts from single depth images. Commun. ACM, 56(1):116--124, 2013.

Digital Library

[34]

M. Sun, P. Kohli, and J. Shotton. Conditional regression forests for human pose estimation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3394--3401, 2012.

Digital Library

[35]

B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct prediction of 3d body poses from motion compensated sequences. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[36]

O. Tian, C. L. Zitnick, and S. G. Narasimhan. Exploring the spatial hierarchy of mixture models for human pose estimation. In Proc. of European Conference on Computer Vision (ECCV), 2012.

Digital Library

[37]

J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648--656, 2015.

[38]

J. Tompson, A. Jain, Y. Lecun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems (NIPS), pages 1799--1807, 2014.

Digital Library

[39]

A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1653--1660, 2014.

Digital Library

[40]

C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao. Robust estimation of 3d human poses from a single image. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2369--2376, 2014.

Digital Library

[41]

F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of single-image and cross-image representations for person re-identification. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[42]

G. Wang, L. Lin, S. Ding, Y. Li, and Q. Wang. Dari: Distance metric and representation integration for person verification. In Proc. of AAAI Conference on Artificial Intelligence (AAAI), 2016.

[43]

K. Wang, L. Lin, W. Zuo, S. Gu, and L. Zhang. Dictionary pair classifier driven convolutional neural networksfor object detection. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[44]

K. Wang, X. Wang, L. Lin, M. Wang, and W. Zuo. 3d human activity recognition with reconfigurable convolutional neural networks. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 97--106, 2014.

Digital Library

[45]

K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology (TCVST), 2016.

[46]

M. Wang, X. Liu, and X. Wu. Visual classification by l1-hypergraph modeling. IEEE Transactions on Knowledge and Data Engineering, 27(9):2564--2574, 2015.

Digital Library

[47]

W. Y. P. L. J. H. Xiaodan Liang, Liang Lin and S. Yan. Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval. IEEE Transactions on Multimedia (T-MM), 18(6):1175--1186, 2016.

Digital Library

[48]

Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(12):2878--2890, 2013.

Digital Library

[49]

A. L. Yuille and A. Rangarajan. The concave-convex procedure (cccp). In Advances in Neural Information Processing Systems (NIPS), 2002.

Digital Library

[50]

R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Transactions on Image Processing (T-IP), 24(12):4766--4779, 2015.

Digital Library

Cited By

Bhuyan SKar ASen DDeb S(2024)RGB-D Fusion Through Zero-Shot Fuzzy Membership Learning for Salient Object DetectionIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.33766405:7(3638-3652)Online publication date: Jul-2024
https://doi.org/10.1109/TAI.2024.3376640
Wu XZhang HKong CWang YJu YZhao C(2024)LiDAR-Based 3-D Human Pose Estimation and Action Recognition for Medical ScenesIEEE Sensors Journal10.1109/JSEN.2024.337319224:9(15531-15539)Online publication date: 1-May-2024
https://doi.org/10.1109/JSEN.2024.3373192
Li GZhao KZhang SLyu XDusmanu MZhang YPollefeys MTang S(2024)EgoGen: An Egocentric Synthetic Data Generator2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01374(14497-14509)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01374
Show More Cited By

Index Terms

Human Pose Estimation from Depth Images via Inference Embedded Multi-task Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
      2. Computer vision tasks
        Scene understanding
        Vision for robotics

Recommendations

Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network
CVPRW '14: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops

We propose an heterogeneous multi-task learning framework for human pose estimation from monocular image with deep convolutional neural network. In particular, we simultaneously learn a pose-joint regressor and a sliding-window body-part detector in a ...
Body parts relevance learning via expectation–maximization for human pose estimation
Abstract
Recently, most existing human pose estimation methods fuse multi-stage convolutional modules to learn a shared feature representation. In this paper, we propose a expectation–maximization (EM) mapping-based network to learn specific related body ...
Human pose estimation via multi-layer composite models

We introduce a hierarchical part-based approach for human pose estimation in static images. Our model is a multi-layer composite of tree-structured pictorial-structure models, each modeling human pose at a different scale and with a different graphical ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '16: Proceedings of the 24th ACM international conference on Multimedia

October 2016

1542 pages

ISBN:9781450336031

DOI:10.1145/2964284

General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Guangdong Natural Science Foundation
Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase)
the Fundamental Research Funds for the Central Universities
State Key Development Program

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 15 - 19, 2016

Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
418
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bhuyan SKar ASen DDeb S(2024)RGB-D Fusion Through Zero-Shot Fuzzy Membership Learning for Salient Object DetectionIEEE Transactions on Artificial Intelligence10.1109/TAI.2024.33766405:7(3638-3652)Online publication date: Jul-2024
https://doi.org/10.1109/TAI.2024.3376640
Wu XZhang HKong CWang YJu YZhao C(2024)LiDAR-Based 3-D Human Pose Estimation and Action Recognition for Medical ScenesIEEE Sensors Journal10.1109/JSEN.2024.337319224:9(15531-15539)Online publication date: 1-May-2024
https://doi.org/10.1109/JSEN.2024.3373192
Li GZhao KZhang SLyu XDusmanu MZhang YPollefeys MTang S(2024)EgoGen: An Egocentric Synthetic Data Generator2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01374(14497-14509)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01374
Ballester IPeterka OKampel M(2024)SPiKE: 3D Human Pose from Point Cloud SequencesPattern Recognition10.1007/978-3-031-78456-9_30(470-486)Online publication date: 3-Dec-2024
https://doi.org/10.1007/978-3-031-78456-9_30
Xing HYang JXiao Y(2023)Learning dynamic relationship between joints for 3D hand pose estimation from single depth mapJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10380392(103803)Online publication date: Apr-2023
https://doi.org/10.1016/j.jvcir.2023.103803
Xing HYang J(2022)WPL-Based Constraint for 3D Human Pose Estimation from a Single Depth ImageSensors10.3390/s2223904022:23(9040)Online publication date: 22-Nov-2022
https://doi.org/10.3390/s22239040
Guo YLi ZLi ZDu XQuan SXu Y(2022)PoP-Net: Pose over Parts Network for Multi-Person 3D Pose Estimation from a Depth Image2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV51458.2022.00397(3917-3926)Online publication date: Jan-2022
https://doi.org/10.1109/WACV51458.2022.00397
Zhang BXiao YXiong FWu CCao ZLiu PZhou J(2022)3D human pose estimation with cross-modality training and multi-scale local refinementApplied Soft Computing10.1016/j.asoc.2022.108950122(108950)Online publication date: Jun-2022
https://doi.org/10.1016/j.asoc.2022.108950
Li LPrabhu SXie ZWang HLu LXu X(2021)Lifting Posture Prediction With Generative Models for Improving Occupational SafetyIEEE Transactions on Human-Machine Systems10.1109/THMS.2021.310251151:5(494-503)Online publication date: Oct-2021
https://doi.org/10.1109/THMS.2021.3102511
Yu LShen LYang HJiang XYan B(2021)A Distortion-Aware Multi-Task Learning Framework for Fractional Interpolation in Video CodingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.302833031:7(2824-2836)Online publication date: Jul-2021
https://doi.org/10.1109/TCSVT.2020.3028330
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten