[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

UformPose: A U-Shaped Hierarchical Multi-Scale Keypoint-Aware Framework for Human Pose Estimation

Published: 01 April 2023 Publication History

Abstract

Human pose estimation is a fundamental yet challenging task in computer vision. However, difficult scenarios such as invisible keypoints, occlusions and small-scale persons are still not well-handed. In this paper, we present a novel pose estimation framework named UformPose which targets to relieve these issues. UformPose has two core designs: Shared Feature Pyramid Stem (SFPS) and U-shaped hierarchical Multi-scale Keypoint-aware Attention Module (U-MKAM). SFPS is a feature pyramid stem with shared mechanism to learn stronger low-level features at the initial stage, and the shared mechanism can facilitate cross-resolution commonality learning. Our U-MKAM attempts to generate high-quality high-resolution representations by integrating all levels of feature representation of the backbone layer by layer. More importantly, we utilize the flexibility of attention operations for keypoint-aware modeling, which explicitly captures and trades-offs the dependencies between keypoints. We empirically demonstrate the effectiveness of our framework through the competitive pose estimation results on the COCO dataset. Extensive experiments and visual analysis on CrowdPose demonstrate the robustness of our model in crowd scenes.

References

[1]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
[2]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.
[3]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, and N. Houlsby, “An image is worth 16×16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929.
[4]
S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4724–4732.
[5]
K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988. 10.1109/ICCV.2017.322.
[6]
X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1831–1840.
[7]
Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7103–7112.
[8]
Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 7291–7299.
[9]
A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-to-end learning for joint detection and grouping,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 2274–2284.
[10]
A. Zanfir, E. Marinoiu, M. Zanfir, A.-I. Popa, and C. Sminchisescu, “Deep network for the integrated 3D sensing of multiple people in natural images,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 8420–8429.
[11]
L. Pishchulinet al., “DeepCut: Joint subset partition and labeling for multi person pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 4929–4937.
[12]
S. Kreiss, L. Bertoni, and A. Alahi, “PifPaf: Composite fields for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 11977–11986.
[13]
B. Cheng, B. Xiao, J. Wang, H. Shi, T. S. Huang, and L. Zhang, “HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5386–5395.
[14]
Y. Luo, Z. Xu, P. Liu, Y. Du, and J.-M. Guo, “Multi-person pose estimation via multi-layer fractal network and joints kinship pattern,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 142–155, Jan. 2019.
[15]
Z. Zhang, Y. Luo, and J. Gou, “Double anchor embedding for accurate multi-person 2D pose estimation,” Image Vis. Comput., vol. 111, Jul. 2021, Art. no.
[16]
B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 466–481.
[17]
K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 5693–5703.
[18]
X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 529–545.
[19]
Z. Ou, Y. Luo, J. Chen, and G. Chen, “SRFNet: Selective receptive field network for human pose estimation,” J. Supercomput., vol. 78, no. 1, pp. 691–711, Jan. 2022.
[20]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[21]
Y. Liet al., “TokenPose: Learning keypoint tokens for human pose estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 11313–11322.
[22]
T.-Y. Linet al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[23]
J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “CrowdPose: Efficient crowded scenes pose estimation and a new benchmark,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 10863–10872.
[24]
L. Dong, X. Chen, R. Wang, Q. Zhang, and E. Izquierdo, “ADORE: An adaptive holons representation framework for human pose estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 10, pp. 2803–2813, Oct. 2018. 10.1109/TCSVT.2017.2707477.
[25]
A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 483–499.
[26]
H. Zhanget al., “Human pose estimation with spatial contextual information,” 2019, arXiv:1901.01760.
[27]
Y. Chen, C. Shen, X.-S. Wei, L. Liu, and J. Yang, “Adversarial PoseNet: A structure-aware convolutional network for human pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 1212–1221.
[28]
X. Chu, W. Ouyang, H. Li, and X. Wang, “CRF-CNN: Modeling structured information in human pose estimation,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1–9.
[29]
X. Caoet al., “Anti-confusing: Region-aware network for human pose estimation,” 2019, arXiv:1905.00996.
[30]
S. Yang, Z. Quan, M. Nie, and W. Yang, “TransPose: Towards explainable human pose estimation by transformer,” 2019, arXiv:2012.14214.
[31]
L. Stoffl, M. Vidal, and A. Mathis, “End-to-end trainable multi-instance pose estimation with transformers,” 2021, arXiv:2103.12115.
[32]
K. Li, S. Wang, X. Zhang, Y. Xu, W. Xu, and Z. Tu, “Pose recognition with cascade transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2021, pp. 1944–1953.
[33]
Y. Yuanet al., “HRFormer: High-resolution transformer for dense prediction,” 2021, arXiv:2110.09408.
[34]
R. Khirodkar, V. Chari, A. Agrawal, and A. Tyagi, “Multi-instance pose networks: Rethinking top-down pose estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3122–3131.
[35]
G. Zheng, S. Wang, and B. Yang, “Hierarchical structure correlation inference for pose estimation,” Neurocomputing, vol. 404, pp. 186–197, Sep. 2020.
[36]
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent., 2015, pp. 234–241.
[37]
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2016, pp. 1125–1134.
[38]
W. Wanget al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 568–578.
[39]
Z. Liuet al., “Swin Transformer: Hierarchical vision transformer using shifted Windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 10012–10022.
[40]
X. Huang, Z. Deng, D. Li, and X. Yuan, “MISSFormer: An effective medical image segmentation transformer,” 2021, arXiv:2109.07162.
[41]
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” 2020, arXiv:2010.04159.
[42]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.
[43]
F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 7093–7102.
[44]
J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” 2018, arXiv:1804.02767.
[45]
H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-person pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2334–2343.
[46]
G. Braso, N. Kister, and L. Leal-Taixe, “The center of attention: Center-keypoint grouping via attention for multi-person pose estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 11853–11863.
[47]
A. N. Martinez-Gonzalez, M. Villamizar, O. Canevet, and J.-M. Odobez, “Efficient convolutional neural networks for depth-based multi-person pose estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 11, pp. 4207–4221, Nov. 2020. 10.1109/TCSVT.2019.2952779.
[48]
L. Zhao, J. Xu, C. Gong, J. Yang, W. Zuo, and X. Gao, “Learning to acquire the quality of human pose estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 4, pp. 1555–1568, Apr. 2021. 10.1109/TCSVT.2020.3005522.
[49]
K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporating convolution designs into visual transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 579–588.
[50]
J. Guoet al., “CMT: Convolutional neural networks meet vision transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 12175–12185.
[51]
B. Grahamet al., “LeViT: A vision transformer in ConvNet’s clothing for faster inference,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 12259–12269.
[52]
J. Li and M. Wang, “Multi-person pose estimation with accurate heatmap regression and greedy association,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 8, pp. 5521–5535, Aug. 2022. 10.1109/TCSVT.2022.3153044.

Cited By

View all
  • (2024)Diffusion-Based Hypotheses Generation and Joint-Level Hypotheses Aggregation for 3D Human Pose EstimationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341534834:11_Part_1(10678-10691)Online publication date: 17-Jun-2024
  • (2024)U-COPE: Taking a Further Step to Universal 9D Category-Level Object Pose EstimationComputer Vision – ECCV 202410.1007/978-3-031-72684-2_15(254-270)Online publication date: 29-Sep-2024
  • (2023)Hierarchical Attention Network for Open-Set Fine-Grained Image RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.332500134:5(3891-3904)Online publication date: 16-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology
IEEE Transactions on Circuits and Systems for Video Technology  Volume 33, Issue 4
April 2023
514 pages

Publisher

IEEE Press

Publication History

Published: 01 April 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Diffusion-Based Hypotheses Generation and Joint-Level Hypotheses Aggregation for 3D Human Pose EstimationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341534834:11_Part_1(10678-10691)Online publication date: 17-Jun-2024
  • (2024)U-COPE: Taking a Further Step to Universal 9D Category-Level Object Pose EstimationComputer Vision – ECCV 202410.1007/978-3-031-72684-2_15(254-270)Online publication date: 29-Sep-2024
  • (2023)Hierarchical Attention Network for Open-Set Fine-Grained Image RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.332500134:5(3891-3904)Online publication date: 16-Oct-2023
  • (2023)SmokePose: End-to-End Smoke Keypoint DetectionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.325852733:10(5778-5789)Online publication date: 1-Oct-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media