More Web Proxy on the site http://driver.im/

research-article

UformPose: A U-Shaped Hierarchical Multi-Scale Keypoint-Aware Framework for Human Pose Estimation

Authors:

Jing-Ming GuoAuthors Info & Claims

IEEE Transactions on Circuits and Systems for Video Technology, Volume 33, Issue 4

Pages 1697 - 1709

https://doi.org/10.1109/TCSVT.2022.3213206

Published: 01 April 2023 Publication History

Abstract

Human pose estimation is a fundamental yet challenging task in computer vision. However, difficult scenarios such as invisible keypoints, occlusions and small-scale persons are still not well-handed. In this paper, we present a novel pose estimation framework named UformPose which targets to relieve these issues. UformPose has two core designs: Shared Feature Pyramid Stem (SFPS) and U-shaped hierarchical Multi-scale Keypoint-aware Attention Module (U-MKAM). SFPS is a feature pyramid stem with shared mechanism to learn stronger low-level features at the initial stage, and the shared mechanism can facilitate cross-resolution commonality learning. Our U-MKAM attempts to generate high-quality high-resolution representations by integrating all levels of feature representation of the backbone layer by layer. More importantly, we utilize the flexibility of attention operations for keypoint-aware modeling, which explicitly captures and trades-offs the dependencies between keypoints. We empirically demonstrate the effectiveness of our framework through the competitive pose estimation results on the COCO dataset. Extensive experiments and visual analysis on CrowdPose demonstrate the robustness of our model in crowd scenes.

References

[1]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.

[2]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.

[3]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, and N. Houlsby, “An image is worth 16×16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929.

[4]

S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4724–4732.

[5]

K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988. 10.1109/ICCV.2017.322.

[6]

X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1831–1840.

[7]

Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7103–7112.

[8]

Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 7291–7299.

[9]

A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-to-end learning for joint detection and grouping,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 2274–2284.

[10]

A. Zanfir, E. Marinoiu, M. Zanfir, A.-I. Popa, and C. Sminchisescu, “Deep network for the integrated 3D sensing of multiple people in natural images,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 8420–8429.

[11]

L. Pishchulinet al., “DeepCut: Joint subset partition and labeling for multi person pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 4929–4937.

[12]

S. Kreiss, L. Bertoni, and A. Alahi, “PifPaf: Composite fields for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 11977–11986.

[13]

B. Cheng, B. Xiao, J. Wang, H. Shi, T. S. Huang, and L. Zhang, “HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5386–5395.

[14]

Y. Luo, Z. Xu, P. Liu, Y. Du, and J.-M. Guo, “Multi-person pose estimation via multi-layer fractal network and joints kinship pattern,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 142–155, Jan. 2019.

Digital Library

[15]

Z. Zhang, Y. Luo, and J. Gou, “Double anchor embedding for accurate multi-person 2D pose estimation,” Image Vis. Comput., vol. 111, Jul. 2021, Art. no.

[16]

B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 466–481.

[17]

K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 5693–5703.

[18]

X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 529–545.

[19]

Z. Ou, Y. Luo, J. Chen, and G. Chen, “SRFNet: Selective receptive field network for human pose estimation,” J. Supercomput., vol. 78, no. 1, pp. 691–711, Jan. 2022.

Digital Library

[20]

A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[21]

Y. Liet al., “TokenPose: Learning keypoint tokens for human pose estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 11313–11322.

[22]

T.-Y. Linet al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.

[23]

J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “CrowdPose: Efficient crowded scenes pose estimation and a new benchmark,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 10863–10872.

[24]

L. Dong, X. Chen, R. Wang, Q. Zhang, and E. Izquierdo, “ADORE: An adaptive holons representation framework for human pose estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 10, pp. 2803–2813, Oct. 2018. 10.1109/TCSVT.2017.2707477.

Digital Library

[25]

A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 483–499.

[26]

H. Zhanget al., “Human pose estimation with spatial contextual information,” 2019, arXiv:1901.01760.

[27]

Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang, “Adversarial PoseNet: A structure-aware convolutional network for human pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1212–1221.

[28]

X. Chu, W. Ouyang, H. Li, and X. Wang, “CRF-CNN: Modeling structured information in human pose estimation,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1–9.

[29]

X. Caoet al., “Anti-confusing: Region-aware network for human pose estimation,” 2019, arXiv:1905.00996.

[30]

S. Yang, Z. Quan, M. Nie, and W. Yang, “TransPose: Towards explainable human pose estimation by transformer,” 2019, arXiv:2012.14214.

[31]

L. Stoffl, M. Vidal, and A. Mathis, “End-to-end trainable multi-instance pose estimation with transformers,” 2021, arXiv:2103.12115.

[32]

K. Li, S. Wang, X. Zhang, Y. Xu, W. Xu, and Z. Tu, “Pose recognition with cascade transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2021, pp. 1944–1953.

[33]

Y. Yuanet al., “HRFormer: High-resolution transformer for dense prediction,” 2021, arXiv:2110.09408.

[34]

R. Khirodkar, V. Chari, A. Agrawal, and A. Tyagi, “Multi-instance pose networks: Rethinking top-down pose estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3122–3131.

[35]

G. Zheng, S. Wang, and B. Yang, “Hierarchical structure correlation inference for pose estimation,” Neurocomputing, vol. 404, pp. 186–197, Sep. 2020.

[36]

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent., 2015, pp. 234–241.

[37]

P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2016, pp. 1125–1134.

[38]

W. Wanget al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 568–578.

[39]

Z. Liuet al., “Swin Transformer: Hierarchical vision transformer using shifted Windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 10012–10022.

[40]

X. Huang, Z. Deng, D. Li, and X. Yuan, “MISSFormer: An effective medical image segmentation transformer,” 2021, arXiv:2109.07162.

[41]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” 2020, arXiv:2010.04159.

[42]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.

[43]

F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 7093–7102.

[44]

J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” 2018, arXiv:1804.02767.

[45]

H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-person pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2334–2343.

[46]

G. Braso, N. Kister, and L. Leal-Taixe, “The center of attention: Center-keypoint grouping via attention for multi-person pose estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 11853–11863.

[47]

A. N. Martinez-Gonzalez, M. Villamizar, O. Canevet, and J.-M. Odobez, “Efficient convolutional neural networks for depth-based multi-person pose estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 11, pp. 4207–4221, Nov. 2020. 10.1109/TCSVT.2019.2952779.

[48]

L. Zhao, J. Xu, C. Gong, J. Yang, W. Zuo, and X. Gao, “Learning to acquire the quality of human pose estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 4, pp. 1555–1568, Apr. 2021. 10.1109/TCSVT.2020.3005522.

[49]

K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporating convolution designs into visual transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 579–588.

[50]

J. Guoet al., “CMT: Convolutional neural networks meet vision transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 12175–12185.

[51]

B. Grahamet al., “LeViT: A vision transformer in ConvNet’s clothing for faster inference,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 12259–12269.

[52]

J. Li and M. Wang, “Multi-person pose estimation with accurate heatmap regression and greedy association,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5521–5535, Aug. 2022. 10.1109/TCSVT.2022.3153044.

Digital Library

Cited By

Shan WZhang YZhang XWang SZhou XMa SGao W(2024)Diffusion-Based Hypotheses Generation and Joint-Level Hypotheses Aggregation for 3D Human Pose EstimationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341534834:11_Part_1(10678-10691)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3415348
Zhang LMeng WZhong YKong BXu MDu JWang XWang RLiu L(2024)U-COPE: Taking a Further Step to Universal 9D Category-Level Object Pose EstimationComputer Vision – ECCV 202410.1007/978-3-031-72684-2_15(254-270)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72684-2_15
Sun JWang HDong Q(2023)Hierarchical Attention Network for Open-Set Fine-Grained Image RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.332500134:5(3891-3904)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3325001
Show More Cited By

Recommendations

Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

In multi-view 3D human pose estimation (HPE), information from different viewpoints is highly variable due to complex factors such as background and occlusion, making cross-view feature extrac tion and fusion difficult. Most existing methods have ...
Human pose estimation via multi-layer composite models

We introduce a hierarchical part-based approach for human pose estimation in static images. Our model is a multi-layer composite of tree-structured pictorial-structure models, each modeling human pose at a different scale and with a different graphical ...
3D Human pose estimation

Review of the recent literature in 3D human pose estimation from RGB images and videos.Release of a challenging, publicly available, 3D pose estimation synthetic dataset.Extensive experimental evaluation of some representative state-of-the-art methods. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology

IEEE Transactions on Circuits and Systems for Video Technology Volume 33, Issue 4

April 2023

514 pages

ISSN:1051-8215

Issue’s Table of Contents

1051-8215 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 April 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shan WZhang YZhang XWang SZhou XMa SGao W(2024)Diffusion-Based Hypotheses Generation and Joint-Level Hypotheses Aggregation for 3D Human Pose EstimationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341534834:11_Part_1(10678-10691)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3415348
Zhang LMeng WZhong YKong BXu MDu JWang XWang RLiu L(2024)U-COPE: Taking a Further Step to Universal 9D Category-Level Object Pose EstimationComputer Vision – ECCV 202410.1007/978-3-031-72684-2_15(254-270)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72684-2_15
Sun JWang HDong Q(2023)Hierarchical Attention Network for Open-Set Fine-Grained Image RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.332500134:5(3891-3904)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3325001
Jing TZeng MMeng Q(2023)SmokePose: End-to-End Smoke Keypoint DetectionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.325852733:10(5778-5789)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3258527

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents