Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15105))

Included in the following conference series:

European Conference on Computer Vision

36 Accesses

Abstract

Besides a 3D mesh, Human Mesh Recovery (HMR) methods usually need to estimate a camera for computing 2D reprojection loss. Previous approaches may encounter the following problem: both the mesh and camera are not correct but the combination of them can yield a low reprojection loss. To alleviate this problem, we define multiple RoIs (region of interest) containing the same human and propose a multiple-RoI-based HMR method. Our key idea is that with multiple RoIs as input, we can estimate multiple local cameras and have the opportunity to design and apply additional constraints between cameras to improve the accuracy of the cameras and, in turn, the accuracy of the corresponding 3D mesh. To implement this idea, we propose a RoI-aware feature fusion network by which we estimate a 3D mesh shared by all RoIs as well as local cameras corresponding to the RoIs. We observe that local cameras can be converted to the camera of the full image through which we construct a local camera consistency loss as the additional constraint imposed on local cameras. Another benefit of introducing multiple RoIs is that we can encapsulate our network into a contrastive learning framework and apply a contrastive loss to regularize the training of our network. Experiments demonstrate the effectiveness of our multi-RoI HMR method and superiority to recent prior arts. Our code is available at https://github.com/CptDiaos/Multi-RoI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MHPro: Multi-hypothesis Probabilistic Modeling for Human Mesh Recovery

Global-To-Pixel Regression for Human Mesh Recovery

On Robust Cross-view Consistency in Self-supervised Monocular Depth Estimation

Article 21 March 2024

Notes

1.
The authors Yongwei Nie and Changzhen Liu signed the license and produced all the experimental results in this paper. Meta did not have access to the datasets.

References

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Google Scholar
Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8726–8737 (2023)
Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Chapter Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: SimCLR: a simple framework for contrastive learning of visual representations. In: International Conference on Learning Representations, vol. 2 (2020)
Google Scholar
Cheng, Y., Huang, S., Ning, J., Shan, Y.: BoPR: body-aware part regressor for human shape and pose estimation. arXiv preprint arXiv:2303.11675 (2023)
Cho, J., Youwang, K., Oh, T.H.: Cross-attention of disentangled modalities for 3D human mesh recovery with transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 342–359. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19769-7_20
Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds.) Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp. 769–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_45
Dou, Z., et al.: Tore: token reduction for efficient human mesh recovery with transformer. arXiv preprint arXiv:2211.10705 (2022)
Fan, T., Alwala, K.V., Xiang, D., Xu, W., Murphey, T., Mukadam, M.: Revitalizing optimization for 3D human pose and shape estimation: a sparse constrained formulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11457–11466 (2021)
Google Scholar
Fang, Q., Chen, K., Fan, Y., Shuai, Q., Li, J., Zhang, W.: Learning analytical posterior probability for human mesh recovery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8781–8791 (2023)
Google Scholar
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: Proceedings of the IEEE International Conference on Computer Vision (2023)
Google Scholar
Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1381–1388. IEEE (2009)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Google Scholar
Iqbal, U., Xie, K., Guo, Y., Kautz, J., Molchanov, P.: KAMA: 3D keypoint aware body mesh articulation. In: 2021 International Conference on 3D Vision (3DV), pp. 689–699. IEEE (2021)
Google Scholar
Jocher, G., et al.: ultralytics/yolov5: v7. 0-yolov5 sota realtime instance segmentation. Zenodo (2022)
Google Scholar
Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp. 42–52. IEEE (2021)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
Google Scholar
Khirodkar, R., Tripathi, S., Kitani, K.: Occluded human mesh recovery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1715–1725 (2022)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Kissos, I., Fritz, L., Goldman, M., Meir, O., Oks, E., Kliger, M.: Beyond weak perspective for monocular 3D human pose estimation. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12536, pp. 541–554. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66096-3_37
Chapter Google Scholar
Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3d human body estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11127–11137 (2021)
Google Scholar
Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11035–11045 (2021)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019)
Google Scholar
Li, J., Bian, S., Liu, Q., Tang, J., Wang, F., Lu, C.: Niki: neural inverse kinematics with invertible neural networks for 3D human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12933–12942 (2023)
Google Scholar
Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3383–3393 (2021)
Google Scholar
Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: CLIFF: carrying location information in full frames into human pose and shape estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 590–606. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_34
Li, Z., Oskarsson, M., Heyden, A.: 3D human pose and shape estimation through collaborative learning and multi-view model-fitting. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1888–1897 (2021)
Google Scholar
Li, Z., Xu, B., Huang, H., Lu, C., Guo, Y.: Deep two-stream video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 430–439 (2022)
Google Scholar
Lin, K., Lin, C.C., Liang, L., Liu, Z., Wang, L.: MPT: mesh pre-training with transformers for human pose and mesh reconstruction. arXiv preprint arXiv:2211.13357 (2022)
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12939–12948 (2021)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Y., Yang, J., Gu, X., Guo, Y., Yang, G.Z.: EgoHMR: egocentric human mesh recovery via hierarchical latent diffusion model. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9807–9813. IEEE (2023)
Google Scholar
Loper, M., Mahmood, N., Black, M.J.: Mosh: motion and shape capture from sparse markers. ACM Trans. Graph. 33(6), 220–1 (2014)
Article Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)
Article Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: The IEEE International Conference on Computer Vision (ICCV) (2019). https://amass.is.tue.mpg.de
Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 International Conference on 3D Vision (3DV), pp. 506–516. IEEE (2017)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 752–768. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_44
Chapter Google Scholar
Osman, A.A.A., Bolkart, T., Black, M.J.: STAR: sparse trained articulated human body regressor. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 598–613. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_36
Chapter Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019)
Google Scholar
Pavlakos, G., Malik, J., Kanazawa, A.: Human mesh recovery from multiple shots. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1485–1495 (2022)
Google Scholar
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018)
Google Scholar
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022)
Sengupta, A., Budvytis, I., Cipolla, R.: Probabilistic 3D human shape and pose estimation from multiple unconstrained images in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16094–16104 (2021)
Google Scholar
Shetty, K., et al.: PLIKS: a pseudo-linear inverse kinematic solver for 3D human body estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 574–584 (2023)
Google Scholar
Shin, S., Halilaj, E.: Multi-view human pose and shape estimation using learnable volumetric aggregation. arXiv preprint arXiv:2011.13427 (2020)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)
Google Scholar
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67 (2018)
Google Scholar
Wang, W., et al.: Zolly: zoom focal length correctly for perspective-distorted human mesh reconstruction. arXiv preprint arXiv:2303.13796 (2023)
Wang, Y., Daniilidis, K.: Refit: recurrent fitting network for 3D human recovery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14644–14654 (2023)
Google Scholar
Xue, Y., Chen, J., Zhang, Y., Yu, C., Ma, H., Ma, H.: 3D human mesh reconstruction by learning to sample joint adaptive tokens for transformers. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 6765–6773 (2022)
Google Scholar
Yao, P., Fang, Z., Wu, F., Feng, Y., Li, J.: DenseBody: directly regressing dense 3D human pose and shape from a single color image. arXiv preprint arXiv:1903.10153 (2019)
Yoshiyasu, Y.: Deformable mesh transformer for 3D human mesh recovery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17006–17015 (2023)
Google Scholar
Yu, Z., et al.: Skeleton2Mesh: kinematics prior injected unsupervised human mesh recovery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8619–8629 (2021)
Google Scholar
Yuan, Y., Iqbal, U., Molchanov, P., Kitani, K., Kautz, J.: GLAMR: global occlusion-aware human mesh recovery with dynamic cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11038–11049 (2022)
Google Scholar
Zanfir, M., Zanfir, A., Bazavan, E.G., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: THUNDR: transformer-based 3D human reconstruction with markers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12971–12980 (2021)
Google Scholar
Zhang, H., Cao, J., Lu, G., Ouyang, W., Sun, Z.: Learning 3D human shape and pose from dense body parts. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2610–2627 (2020)
Google Scholar
Zhang, H., et al.: PyMAF-X: towards well-aligned full-body model regression from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 45, 12287–12303 (2023)
Article Google Scholar
Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11446–11456 (2021)
Google Scholar
Zhang, J., Yu, D., Liew, J.H., Nie, X., Feng, J.: Body meshes as points. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 546–556 (2021)
Google Scholar
Zhang, S., Ma, Q., Zhang, Y., Aliakbarian, S., Cosker, D., Tang, S.: Probabilistic human mesh recovery in 3D scenes from egocentric views. arXiv preprint arXiv:2304.06024 (2023)
Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2020)
Google Scholar
Zhou, X., Zhu, M., Pavlakos, G., Leonardos, S., Derpanis, K.G., Daniilidis, K.: MonoCap: monocular human motion capture using a CNN coupled with a geometric prior. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 901–914 (2018)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under grant 2022YFE0112200, in part by the Natural Science Foundation of China under grant U21A20520, grant 62325204, and grant 62072191, in part by the Key-Area Research and Development Program of Guangzhou City under grant 202206030009, and in part by the Guangdong Basic and Applied Basic Research Fund under grant 2023A1515030002 and grant 2024A1515011995.

Author information

Authors and Affiliations

South China University of Technology, Guangzhou, China
Yongwei Nie, Changzhen Liu, Guiqing Li & Hongmin Cai
Meta Reality Labs, Menlo Park, CA, USA
Chengjiang Long
Sun Yat-sen University, Guangzhou, China
Qing Zhang

Authors

Yongwei Nie
View author publications
You can also search for this author in PubMed Google Scholar
Changzhen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chengjiang Long
View author publications
You can also search for this author in PubMed Google Scholar
Qing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guiqing Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongmin Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongmin Cai .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9897 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nie, Y., Liu, C., Long, C., Zhang, Q., Li, G., Cai, H. (2025). Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-72970-6_25
Published: 23 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72969-0
Online ISBN: 978-3-031-72970-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MHPro: Multi-hypothesis Probabilistic Modeling for Human Mesh Recovery

Global-To-Pixel Regression for Human Mesh Recovery

On Robust Cross-view Consistency in Self-supervised Monocular Depth Estimation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9897 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MHPro: Multi-hypothesis Probabilistic Modeling for Human Mesh Recovery

Global-To-Pixel Regression for Human Mesh Recovery

On Robust Cross-view Consistency in Self-supervised Monocular Depth Estimation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9897 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation