Multi-HMR: Multi-person Whole-Body Human Mesh Recovery in a Single Shot

Fabien Baradel¹³,
Matthieu Armando¹³,
Salma Galaaoui¹³,
Romain Brégier¹³,
Philippe Weinzaepfel¹³,
Grégory Rogez¹³ &
…
Thomas Lucas¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15081))

Included in the following conference series:

European Conference on Computer Vision

299 Accesses
2 Citations

Abstract

We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on \(448{\times }448\) images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

F. Baradel and T. Lucas—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation

Global-To-Pixel Regression for Human Mesh Recovery

DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model

Notes

References

Blender. https://www.blender.org/
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
Google Scholar
Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR (2023)
Google Scholar
Cai, Z., et al.: SMPLer-X: scaling up expressive human pose and shape estimation. In: NeurIPS (2023)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Choi, H., Moon, G., Park, J., Lee, K.M.: Learning to estimate robust 3D human mesh from in-the-wild crowded scenes. In: CVPR (2022)
Google Scholar
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. In: ECCV (2020)
Google Scholar
De Santis, A., Siciliano, B., De Luca, A., Bicchi, A.: An atlas of physical human–robot interaction. Mechanism and Machine Theory (2008)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Facil, J.M., Ummenhofer, B., Zhou, H., Montesano, L., Brox, T., Civera, J.: CAM-Convs: camera-aware multi-scale convolutions for single-view depth. In: CVPR (2019)
Google Scholar
Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regression of expressive bodies using moderation. In: 3DV (2021)
Google Scholar
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: ICCV (2023)
Google Scholar
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: CVPR (2017)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. PAMI (2013)
Google Scholar
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: ICML (2021)
Google Scholar
Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., Daniilidis, K.: Coherent reconstruction of multiple humans from a single image. In: CVPR (2020)
Google Scholar
Jocher, G., et al.: ultralytics/YOLOv5: v7.0 - YOLOv5 SOTA realtime instance segmentation (2022)
Google Scholar
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)
Google Scholar
Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In: 3DV (2020)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Google Scholar
Kim, J., Gwon, M.G., Park, H., Kwon, H., Um, G.M., Kim, W.: Sampling is matter: point-guided 3D human mesh reconstruction. In: CVPR (2023)
Google Scholar
Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: ICCV (2021)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)
Google Scholar
Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: CLIFF: carrying location information in full frames into human pose and shape estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp. 590–606. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_34
Chapter Google Scholar
Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3D whole-body mesh recovery with component aware transformer. In: CVPR (2023)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, W., et al.: SSD: single shot multibox detector. In: ECCV (2016)
Google Scholar
Ma, X., Su, J., Wang, C., Zhu, W., Wang, Y.: 3D human mesh estimation from virtual markers. In: CVPR (2023)
Google Scholar
von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)
Google Scholar
Mehta, D., et al.: XNect: real-time multi-person 3D motion capture with a single RGB camera. ACM Trans. Graph (2020)
Google Scholar
Mehta, D., et al.: Single-shot multi-person 3d pose estimation from monocular RGB. In: 3DV (2018)
Google Scholar
Mertan, A., Duff, D.J., Unal, G.: Single image depth estimation: an overview. Digital Signal Process. (2022)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NERF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Google Scholar
Moon, G., Choi, H., Chun, S., Lee, J., Yun, S.: Three recipes for better 3D pseudo-GTs of 3D human mesh estimation in the wild. In: CVPR Workshop (2023)
Google Scholar
Moon, G., Choi, H., Lee, K.M.: Accurate 3D hand pose estimation for whole-body 3D human mesh estimation. In: CVPR Worskhop (2022)
Google Scholar
Moon, G., Choi, H., Lee, K.M.: NeuralAnnot: neural annotator for 3D human mesh training sets. In: CVPR Worskhop (2022)
Google Scholar
Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: ECCV (2020)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. TMLR (2023)
Google Scholar
Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: CVPR (2021)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Google Scholar
Qiu, Z., et al.: PSVT: end-to-end multi-person 3D pose and shape estimation with progressive video transformers. In: CVPR (2023)
Google Scholar
Qiu, Z., Yang, Q., Wang, J., Fu, D.: Dynamic graph reasoning for multi-person 3D pose estimation. In: ACMMM (2022)
Google Scholar
Rajasegaran, J., Pavlakos, G., Kanazawa, A., Feichtenhofer, C., Malik, J.: On the benefits of 3D pose and tracking for human action recognition. In: CVPR (2023)
Google Scholar
Rajasegaran, J., Pavlakos, G., Kanazawa, A., Malik, J.: Tracking people by predicting 3D appearance, location and pose. In: CVPR (2022)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
Google Scholar
Rong, Y., Shiratori, T., Joo, H.: FrankMocap: a monocular 3D whole-body pose estimation system via regression and integration. In: ICCV (2021)
Google Scholar
Salzmann, T., Chiang, H.T.L., Ryll, M., Sadigh, D., Parada, C., Bewley, A.: Robots that can see: leveraging human pose for trajectory prediction. IEEE RAL (2023)
Google Scholar
Shah, A., Mishra, S., Bansal, A., Chen, J.C., Chellappa, R., Shrivastava, A.: Pose and joint-aware action recognition. In: WACV (2022)
Google Scholar
Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: ICCV (2021)
Google Scholar
Sun, Y., Liu, W., Bao, Q., Fu, Y., Mei, T., Black, M.J.: Putting people in their place: monocular regression of 3D people in depth. In: CVPR (2022)
Google Scholar
Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)
Google Scholar
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. PAMI (2020)
Google Scholar
Weinzaepfel, P., et al.: CroCo v2: improved cross-view completion pre-training for stereo matching and optical flow. In: ICCV (2023)
Google Scholar
Yang, Z., et al.: Synbody: synthetic dataset with layered human models for 3D human perception and modeling. In: ICCV (2023)
Google Scholar
Yoshiyasu, Y.: Deformable mesh transformer for 3D human mesh recovery. In: CVPR (2023)
Google Scholar
Zhang, H., et al.: PyMAF-X: towards well-aligned full-body model regression from monocular images. IEEE Trans. PAMI (2023)
Google Scholar
Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV (2021)
Google Scholar
Zhen, J., et al.: SMAP: single-shot multi-person absolute 3D pose estimation. In: ECCV (2020)
Google Scholar
Zheng, C., Liu, X., Qi, G.J., Chen, C.: POTTER: pooling attention transformer for efficient human mesh recovery. In: CVPR (2023)
Google Scholar
Zhou, L., Meng, X., Liu, Z., Wu, M., Gao, Z., Wang, P.: Human pose-based estimation, tracking and action recognition with deep learning: a survey. arXiv preprint arXiv:2310.13039 (2023)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhou, Y., Habermann, M., Habibie, I., Tewari, A., Theobalt, C., Xu, F.: Monocular real-time full body capture with inter-part correlations. In: CVPR (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

NAVER LABS Europe, Meylan, France
Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez & Thomas Lucas

Authors

Fabien Baradel
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Armando
View author publications
You can also search for this author in PubMed Google Scholar
Salma Galaaoui
View author publications
You can also search for this author in PubMed Google Scholar
Romain Brégier
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Weinzaepfel
View author publications
You can also search for this author in PubMed Google Scholar
Grégory Rogez
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Lucas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabien Baradel .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baradel, F. et al. (2025). Multi-HMR: Multi-person Whole-Body Human Mesh Recovery in a Single Shot. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15081. Springer, Cham. https://doi.org/10.1007/978-3-031-73337-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-73337-6_12
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73336-9
Online ISBN: 978-3-031-73337-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics