More Web Proxy on the site http://driver.im/

research-article

Pose focus transformer meet inter-part relation

Authors:

Jing-Ming GuoAuthors Info & Claims

Volume 240, Issue C

https://doi.org/10.1016/j.eswa.2023.122476

Published: 15 April 2024 Publication History

Abstract

Human pose estimation in crowded scenes is a challenging task. Due to overlap and occlusion, it is difficult to infer pose clues from individual keypoints. We proposed PFFormer, a new transformer-based approach that treats pose estimation as a hierarchical set prediction problem that first focuses on human windows and coarsely predicts whole-body poses globally within them. In PFFormer, we designed a Windows Clustering Transformer (WCT), which reorganizes the image windows by filtering the attentive windows and fusing the inattentive ones, allowing the transformer to focus on the important regions while reducing the interference from the complex background, followed by compensating for the loss of information with a global transformer. Then we partition the learned body pose into a set of structural parts and perform the Inter-Part Relation Module (IPRM) to capture the correlation between multiple parts. These full-body poses and component features are refined at a finer level through the Part-to-Joint Decoder (PJD). Extensive experiments show that PFFormer performs favorably against its counterpart on challenging datasets, including COCO2017, CrowdPose, and OChuman datasets. The performance of crowded scenes, in particular, demonstrates the robustness of the proposed methods to deal with occlusion.

References

[1]

Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S., End-to-end object detection with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229.

[2]

Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J. (2018). Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7103–7112).

[3]

Ding Y., Deng W., Zheng Y., Liu P., Wang M., Cheng X., et al., Î 2R-Net: Intra-and inter-human relation network for multi-person pose estimation, 2022, arXiv preprint arXiv:2206.10892.

[4]

Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., et al. (2022). Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12124–12134).

[5]

Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010.11929.

[6]

Fang, H.-S., Xie, S., Tai, Y.-W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 2334–2343).

[7]

Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14676–14686).

[8]

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

[9]

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

[10]

Khirodkar, R., Chari, V., Agrawal, A., & Tyagi, A. (2021). Multi-instance pose networks: Rethinking top-down pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3122–3131).

[11]

LeCun Y., Bottou L., Bengio Y., Haffner P., Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.

[12]

Li J., Wang M., Multi-person pose estimation with accurate heatmap regression and greedy association, IEEE Transactions on Circuits and Systems for Video Technology 32 (8) (2022) 5521–5535.

[13]

Li, K., Wang, S., Zhang, X., Xu, Y., Xu, W., & Tu, Z. (2021). Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1944–1953).

[14]

Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.-S., & Lu, C. (2019). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10863–10872).

[15]

Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., et al. (2021). Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11313–11322).

[16]

Liang Y., Ge C., Tong Z., Song Y., Wang J., Xie P., Not all patches are what you need: Expediting vision transformers via token reorganizations, 2022, arXiv preprint arXiv:2202.07800.

[17]

Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., et al., Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, proceedings, Part V 13, Springer, 2014, pp. 740–755.

[18]

Lin, K., Wang, L., & Liu, Z. (2021). Mesh graphormer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12939–12948).

[19]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).

[20]

Luo Y., Xu Z., Liu P., Du Y., Guo J.-M., Multi-person pose estimation via multi-layer fractal network and joints kinship pattern, IEEE Transactions on Image Processing 28 (1) (2018) 142–155.

[21]

Mao W., Ge Y., Shen C., Tian Z., Wang X., Wang Z., Tfpose: Direct human pose estimation with transformers, 2021, arXiv preprint arXiv:2103.15320.

[22]

Martínez-González A., Villamizar M., Canévet O., Odobez J.-M., Efficient convolutional neural networks for depth-based multi-person pose estimation, IEEE Transactions on Circuits and Systems for Video Technology 30 (11) (2019) 4207–4221.

[23]

Newell A., Huang Z., Deng J., Associative embedding: End-to-end learning for joint detection and grouping, in: Advances in neural information processing systems, vol. 30, 2017.

[24]

Newell A., Yang K., Deng J., Stacked hourglass networks for human pose estimation, in: Computer Vision–ECCV 2016: 14th European conference, Amsterdam, the Netherlands, October 11-14, 2016, proceedings, Part VIII 14, Springer, 2016, pp. 483–499.

[25]

Nie, X., Feng, J., Zhang, J., & Yan, S. (2019). Single-stage multi-person pose machines. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6951–6960).

[26]

Qiu L., Zhang X., Li Y., Li G., Wu X., Xiong Z., et al., Peeking into occluded joints: A novel framework for crowd pose estimation, in: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, Springer, 2020, pp. 488–504.

[27]

Shi, D., Wei, X., Li, L., Ren, Y., & Tan, W. (2022). End-to-end multi-person pose estimation with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11069–11078).

[28]

Stoffl L., Vidal M., Mathis A., End-to-end trainable multi-instance pose estimation with transformers, 2021, arXiv preprint arXiv:2103.12115.

[29]

Sun, X., Shang, J., Liang, S., & Wei, Y. (2017). Compositional human pose regression. In Proceedings of the IEEE international conference on computer vision (pp. 2602–2611).

[30]

Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5693–5703).

[31]

Tang, W., & Wu, Y. (2019). Does learning specific features for related parts help human pose estimation?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1107–1116).

[32]

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., et al., Attention is all you need, in: Advances in neural information processing systems, vol. 30, 2017.

[33]

Wang Y.-J., Luo Y.-M., Bai G.-H., Guo J.-M., UformPose: A U-shaped hierarchical multi-scale keypoint-aware framework for human pose estimation, IEEE Transactions on Circuits and Systems for Video Technology 33 (4) (2022) 1697–1709.

[34]

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., et al. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 568–578).

[35]

Wang, W., Xu, Y., Shen, J., & Zhu, S.-C. (2018). Attentive fashion grammar network for fashion landmark detection and clothing category classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4271–4280).

[36]

Wang D., Zhang S., Hua G., Robust pose estimation in crowded scenes with direct pose-level inference, Advances in Neural Information Processing Systems 34 (2021) 6278–6289.

[37]

Wang W., Zhou T., Qi S., Shen J., Zhu S.-C., Hierarchical human semantic parsing with comprehensive part-relation modeling, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (7) (2021) 3508–3522.

[38]

Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., & Shao, L. (2020). Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8929–8939).

[39]

Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (pp. 466–481).

[40]

Xu Y., Wang W., Liu T., Liu X., Xie J., Zhu S.-C., Monocular 3D pose estimation via pose grammar and data augmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (10) (2021) 6327–6344.

[41]

Xue N., Wu T., Zhang Z., Xia G.-S., Learning local-global contextual adaptation for fully end-to-end bottom-up human pose estimation, 2021.

[42]

Yang J., Li C., Zhang P., Dai X., Xiao B., Yuan L., et al., Focal self-attention for local-global interactions in vision transformers, 2021, arXiv preprint arXiv:2107.00641.

[43]

Yang S., Quan Z., Nie M., Yang W., Transpose: Towards explainable human pose estimation by transformer, 2020, arXiv preprint arXiv:2012.14214, 2.

[44]

Yuan Y., Fu R., Huang L., Lin W., Zhang C., Chen X., et al., Hrformer: High-resolution transformer for dense prediction, 2021, arXiv preprint arXiv:2110.09408.

[45]

Zhang, S.-H., Li, R., Dong, X., Rosin, P., Cai, Z., Han, X., et al. (2019). Pose2seg: Detection free human instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 889–898).

[46]

Zhang Z., Luo Y., Gou J., Double anchor embedding for accurate multi-person 2D pose estimation, Image and Vision Computing 111 (2021).

[47]

Zhang, F., Zhu, X., Dai, H., Ye, M., & Zhu, C. (2020). Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7093–7102).

[48]

Zhao L., Xu J., Gong C., Yang J., Zuo W., Gao X., Learning to acquire the quality of human pose estimation, IEEE Transactions on Circuits and Systems for Video Technology 31 (4) (2020) 1555–1568.

[49]

Zhou T., Yang Y., Wang W., Differentiable multi-granularity human parsing, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

[50]

Zhu X., Su W., Lu L., Li B., Wang X., Dai J., Deformable detr: Deformable transformers for end-to-end object detection, 2020, arXiv preprint arXiv:2010.04159.

Recommendations

Globally-Robust Instance Identification and Locally-Accurate Keypoint Alignment for Multi-Person Pose Estimation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Scenes with a large number of human instances are characterized by significant overlap of the instances with similar appearance, occlusion, and scale variation. We propose GRAPE, a novel method that leverages both Globally Robust human instance ...
Tracking Human Pose by Tracking Symmetric Parts
CVPR '13: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition

The human body is structurally symmetric. Tracking by detection approaches for human pose suffer from \emph{double counting}, where the same image evidence is used to explain two separate but symmetric parts, such as the left and right feet. Double ...
Progressive Temporal Transformer for Bird’s-Eye-View Camera Pose Estimation
Neural Information Processing
Abstract
Visual relocalization is a crucial technique used in visual odometry and SLAM to predict the 6-DoF camera pose of a query image. Existing works mainly focus on ground view in indoor or outdoor scenes. However, camera relocalization on unmanned ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal

Expert Systems with Applications: An International Journal Volume 240, Issue C

Apr 2024

1601 pages

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 15 April 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents