[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

GLPose: Global-Local Representation Learning for Human Pose Estimation

Published: 06 October 2022 Publication History

Abstract

Multi-frame human pose estimation is at the core of many computer vision tasks. Although state-of-the-art approaches have demonstrated remarkable results for human pose estimation on static images, their performances inevitably come short when being applied to videos. A central issue lies in the visual degeneration of video frames induced by rapid motion and pose occlusion in dynamic environments. This problem, by nature, is insurmountable for a single frame. Therefore, incorporating complementary visual cues from other video frames becomes an intuitive paradigm. Current state-of-the-art methods usually leverage information from adjacent frames, which unfortunately place excessive focus on only the temporally nearby frames. In this paper, we argue that combining global semantically similar information and local temporal visual context will deliver more comprehensive and more robust representations for human pose estimation. Towards this end, we present an effective framework, namely global-local enhanced pose estimation (GLPose) network. Our framework consists of a feature processing module that conditionally incorporates global semantic information and local visual context to generate a robust human representation and a feature enhancement module that excavates complementary information from this aggregated representation to enhance keyframe features for precise estimation. We empirically find that the proposed GLpose outperforms existing methods by a large margin and achieves new state-of-the-art results on large benchmark datasets.

References

[1]
Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Schiele. 2018. PoseTrack: A benchmark for human pose estimation and tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5167–5176.
[2]
Bruno Artacho and Andreas Savakis. 2020. UniPose: Unified human pose estimation in single images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7035–7044.
[3]
Qian Bao, Wu Liu, Yuhao Cheng, Boyan Zhou, and Tao Mei. 2020. Pose-guided tracking-by-detection: Robust multi-person pose tracking. IEEE Transactions on Multimedia 23 (2020), 161–175.
[4]
Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, and Lorenzo Torresani. 2019. Learning temporal pose estimation from sparsely-labeled videos. In Advances in Neural Information Processing Systems. 3027–3038.
[5]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6]
Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4733–4742.
[7]
Shuning Chang, Li Yuan, Xuecheng Nie, Ziyuan Huang, Yichen Zhou, Yupeng Chen, Jiashi Feng, and Shuicheng Yan. 2020. Towards accurate human pose estimation in videos of crowded scenes. In Proceedings of the 28th ACM International Conference on Multimedia. 4630–4634.
[8]
James Charles, Tomas Pfister, Derek Magee, David Hogg, and Andrew Zisserman. 2016. Personalizing human video pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3063–3072.
[9]
Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10337–10346.
[10]
Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei Zhang. 2020. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5386–5395.
[11]
Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. Yuille, and Xiaogang Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1831–1840.
[12]
Matthias Dantone, Juergen Gall, Christian Leistner, and Luc Van Gool. 2013. Human pose estimation using body parts dependent joint regressors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3041–3048.
[13]
Andreas Doering, Umar Iqbal, and Juergen Gall. 2018. Joint flow: Temporal flow fields for multi person tracking. arXiv preprint arXiv:1805.04596 (2018).
[14]
Zhipeng Fan, Jun Liu, and Yao Wang. 2021. Motion adaptive pose estimation from compressed videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11719–11728.
[15]
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2334–2343.
[16]
Zan Gao, Yuxiang Shao, Weili Guan, Meng Liu, Zhiyong Cheng, and Shengyong Chen. 2021. A novel patch convolutional neural network for view-based 3D model retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 2699–2707.
[17]
Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. 2018. Detect-and-track: Efficient pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 350–359.
[18]
Hengkai Guo, Tang Tang, Guozhong Luo, Riwei Chen, Yongchen Lu, and Linfu Wen. 2018. Multi-domain pose network for multi-person pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV).
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[20]
Jihye Hwang, Jieun Lee, Sungheon Park, and Nojun Kwak. 2019. Pose estimator and tracker using temporal flow maps for limbs. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
[21]
Umar Iqbal, Martin Garbade, and Juergen Gall. 2017. Pose for action-action for pose. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 438–445.
[22]
Sheng Jin, Wentao Liu, Wanli Ouyang, and Chen Qian. 2019. Multi-person articulated tracking with spatial and temporal embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5664–5673.
[23]
Jiefeng Li, Siyuan Bian, Ailing Zeng, Can Wang, Bo Pang, Wentao Liu, and Cewu Lu. 2021. Human pose regression with residual log-likelihood estimation. arXiv preprint arXiv:2107.11291 (2021).
[24]
Kyaw Zaw Lin, Weipeng Xu, Qianru Sun, Christian Theobalt, and Tat-Seng Chua. 2018. Learning a disentangled embedding for monocular 3D shape retrieval and pose estimation. arXiv preprint arXiv:1812.09899 (2018).
[25]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision. Springer, 740–755.
[26]
Meng Liu, Leigang Qu, Liqiang Nie, Maofu Liu, Lingyu Duan, and Baoquan Chen. 2020. Iterative local-global collaboration learning towards one-shot video person re-identification. IEEE Transactions on Image Processing 29 (2020), 9360–9372.
[27]
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In Proceedings of the 26th ACM International Conference on Multimedia. 843–851.
[28]
Zhenguang Liu, Haoming Chen, Runyang Feng, Shuang Wu, Shouling Ji, Bailin Yang, and Xun Wang. 2021. Deep dual consecutive network for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 525–534.
[29]
Zhenguang Liu, Shuang Wu, Shuyuan Jin, Shouling Ji, Qi Liu, Shijian Lu, and Li Cheng. 2021. Investigating pose representations and motion contexts modeling for 3D motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2021), 1–16. DOI:
[30]
Yue Luo, Jimmy Ren, Zhouxia Wang, Wenxiu Sun, Jinshan Pan, Jianbo Liu, Jiahao Pang, and Liang Lin. 2018. LSTM pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5207–5215.
[31]
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483–499.
[32]
Xuecheng Nie, Yuncheng Li, Linjie Luo, Ning Zhang, and Jiashi Feng. 2019. Dynamic kernel distillation for efficient pose estimation in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6942–6950.
[33]
Dennis Park and Deva Ramanan. 2011. N-best maximal decoders for part models. In 2011 International Conference on Computer Vision. IEEE, 2627–2634.
[34]
Tomas Pfister, James Charles, and Andrew Zisserman. 2015. Flowing ConvNets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision. 1913–1921.
[35]
Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, and Yaser Sheikh. 2019. Efficient online multi-person 2D pose tracking with recurrent spatio-temporal affinity fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4620–4628.
[36]
Benjamin Sapp, Alexander Toshev, and Ben Taskar. 2010. Cascaded models for articulated pose estimation. In European Conference on Computer Vision. Springer, 406–420.
[37]
Luca Schmidtke, Athanasios Vlontzos, Simon Ellershaw, Anna Lukens, Tomoki Arichi, and Bernhard Kainz. 2021. Unsupervised human pose estimation through transforming shape templates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2484–2494.
[38]
Jie Song, Limin Wang, Luc Van Gool, and Otmar Hilliges. 2017. Thin-slicing network: A deep structured model for pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4220–4229.
[39]
Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, and Changhu Wang. 2019. Multi-person pose estimation with enhanced channel-wise and spatial information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5674–5682.
[40]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5693–5703.
[41]
Min Sun, Pushmeet Kohli, and Jamie Shotton. 2012. Conditional regression forests for human pose estimation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3394–3401.
[42]
Yi Tan, Yanbin Hao, Xiangnan He, Yinwei Wei, and Xun Yang. 2021. Selective dependency aggregation for action classification. In Proceedings of the 29th ACM International Conference on Multimedia. 592–601.
[43]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6619–6628.
[44]
Alexander Toshev and Christian Szegedy. 2014. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45]
Ali Varamesh and Tinne Tuytelaars. 2020. Mixture dense regression for object detection and human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13086–13095.
[46]
Fang Wang and Yi Li. 2013. Beyond physical connections: Tree models in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 596–603.
[47]
Manchen Wang, Joseph Tighe, and Davide Modolo. 2020. Combining detection and tracking for human pose estimation in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11088–11096.
[48]
Yang Wang and Greg Mori. 2008. Multiple tree models for occlusion and spatial constraints in human pose estimation. In European Conference on Computer Vision. Springer, 710–724.
[49]
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50]
Haiping Wu, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2019. Sequence level semantics aggregation for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9217–9225.
[51]
Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV). 466–481.
[52]
Bruce Xiaohan Nie, Caiming Xiong, and Song-Chun Zhu. 2015. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1293–1301.
[53]
Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. 2018. Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977 (2018).
[54]
Xun Yang, Meng Wang, and Dacheng Tao. 2017. Person re-identification with metric learning using privileged information. IEEE Transactions on Image Processing 27, 2 (2017), 791–805.
[55]
Yiding Yang, Zhou Ren, Haoxiang Li, Chunluan Zhou, Xinchao Wang, and Gang Hua. 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8074–8084.
[56]
Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu. 2020. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7093–7102.
[57]
Jiabin Zhang, Zheng Zhu, Wei Zou, Peng Li, Yanwei Li, Hu Su, and Guan Huang. 2019. FastPose: Towards real-time pose estimation and tracking via scale-normalized multi-task networks. arXiv preprint arXiv:1908.05593 (2019).
[58]
Xiaoqin Zhang, Changcheng Li, Xiaofeng Tong, Weiming Hu, Steve Maybank, and Yimin Zhang. 2009. Efficient human pose estimation via parsing a tree structure based human model. In 2009 IEEE 12th International Conference on Computer Vision. IEEE, 1349–1356.
[59]
Yuexi Zhang, Yin Wang, Octavia Camps, and Mario Sznaier. 2020. Key frame proposal network for efficient pose estimation in videos. In European Conference on Computer Vision. Springer, 609–625.

Cited By

View all
  • (2024)Category-Level Pose Estimation and Iterative Refinement for Monocular RGB-D ImageACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369587720:12(1-20)Online publication date: 11-Sep-2024
  • (2024)Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB VideoACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363970720:6(1-18)Online publication date: 8-Mar-2024
  • (2023)Human Pose Estimation Using Deep Learning: A Systematic Literature ReviewMachine Learning and Knowledge Extraction10.3390/make50400815:4(1612-1659)Online publication date: 13-Nov-2023
  • Show More Cited By

Index Terms

  1. GLPose: Global-Local Representation Learning for Human Pose Estimation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2s
    June 2022
    383 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3561949
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 October 2022
    Online AM: 12 March 2022
    Accepted: 15 February 2022
    Revised: 21 January 2022
    Received: 11 November 2021
    Published in TOMM Volume 18, Issue 2s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Human pose estimation
    2. feature aggregation
    3. pose estimation
    4. global-local representation

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)119
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Category-Level Pose Estimation and Iterative Refinement for Monocular RGB-D ImageACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369587720:12(1-20)Online publication date: 11-Sep-2024
    • (2024)Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB VideoACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363970720:6(1-18)Online publication date: 8-Mar-2024
    • (2023)Human Pose Estimation Using Deep Learning: A Systematic Literature ReviewMachine Learning and Knowledge Extraction10.3390/make50400815:4(1612-1659)Online publication date: 13-Nov-2023
    • (2023)Sparsity-guided Discriminative Feature Encoding for Robust Keypoint DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362843220:3(1-22)Online publication date: 17-Oct-2023
    • (2023)Spatiotemporal Learning Transformer for Video-Based Human Pose EstimationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.326966633:9(4564-4576)Online publication date: 1-Sep-2023
    • (2023)JointMETRO: a 3D reconstruction model for human figures in works of art based on transformerNeural Computing and Applications10.1007/s00521-023-08844-y36:20(11711-11725)Online publication date: 21-Jul-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media