Abstract
We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on \(448{\times }448\) images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.
F. Baradel and T. Lucas—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blender. https://www.blender.org/
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR (2023)
Cai, Z., et al.: SMPLer-X: scaling up expressive human pose and shape estimation. In: NeurIPS (2023)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Choi, H., Moon, G., Park, J., Lee, K.M.: Learning to estimate robust 3D human mesh from in-the-wild crowded scenes. In: CVPR (2022)
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. In: ECCV (2020)
De Santis, A., Siciliano, B., De Luca, A., Bicchi, A.: An atlas of physical human–robot interaction. Mechanism and Machine Theory (2008)
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: ICLR (2021)
Facil, J.M., Ummenhofer, B., Zhou, H., Montesano, L., Brox, T., Civera, J.: CAM-Convs: camera-aware multi-scale convolutions for single-view depth. In: CVPR (2019)
Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regression of expressive bodies using moderation. In: 3DV (2021)
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: ICCV (2023)
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: CVPR (2017)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. PAMI (2013)
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: ICML (2021)
Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., Daniilidis, K.: Coherent reconstruction of multiple humans from a single image. In: CVPR (2020)
Jocher, G., et al.: ultralytics/YOLOv5: v7.0 - YOLOv5 SOTA realtime instance segmentation (2022)
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)
Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In: 3DV (2020)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Kim, J., Gwon, M.G., Park, H., Kwon, H., Um, G.M., Kim, W.: Sampling is matter: point-guided 3D human mesh reconstruction. In: CVPR (2023)
Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: ICCV (2021)
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)
Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: CLIFF: carrying location information in full frames into human pose and shape estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp. 590–606. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_34
Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3D whole-body mesh recovery with component aware transformer. In: CVPR (2023)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, W., et al.: SSD: single shot multibox detector. In: ECCV (2016)
Ma, X., Su, J., Wang, C., Zhu, W., Wang, Y.: 3D human mesh estimation from virtual markers. In: CVPR (2023)
von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)
Mehta, D., et al.: XNect: real-time multi-person 3D motion capture with a single RGB camera. ACM Trans. Graph (2020)
Mehta, D., et al.: Single-shot multi-person 3d pose estimation from monocular RGB. In: 3DV (2018)
Mertan, A., Duff, D.J., Unal, G.: Single image depth estimation: an overview. Digital Signal Process. (2022)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NERF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Moon, G., Choi, H., Chun, S., Lee, J., Yun, S.: Three recipes for better 3D pseudo-GTs of 3D human mesh estimation in the wild. In: CVPR Workshop (2023)
Moon, G., Choi, H., Lee, K.M.: Accurate 3D hand pose estimation for whole-body 3D human mesh estimation. In: CVPR Worskhop (2022)
Moon, G., Choi, H., Lee, K.M.: NeuralAnnot: neural annotator for 3D human mesh training sets. In: CVPR Worskhop (2022)
Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: ECCV (2020)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. TMLR (2023)
Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: CVPR (2021)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Qiu, Z., et al.: PSVT: end-to-end multi-person 3D pose and shape estimation with progressive video transformers. In: CVPR (2023)
Qiu, Z., Yang, Q., Wang, J., Fu, D.: Dynamic graph reasoning for multi-person 3D pose estimation. In: ACMMM (2022)
Rajasegaran, J., Pavlakos, G., Kanazawa, A., Feichtenhofer, C., Malik, J.: On the benefits of 3D pose and tracking for human action recognition. In: CVPR (2023)
Rajasegaran, J., Pavlakos, G., Kanazawa, A., Malik, J.: Tracking people by predicting 3D appearance, location and pose. In: CVPR (2022)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
Rong, Y., Shiratori, T., Joo, H.: FrankMocap: a monocular 3D whole-body pose estimation system via regression and integration. In: ICCV (2021)
Salzmann, T., Chiang, H.T.L., Ryll, M., Sadigh, D., Parada, C., Bewley, A.: Robots that can see: leveraging human pose for trajectory prediction. IEEE RAL (2023)
Shah, A., Mishra, S., Bansal, A., Chen, J.C., Chellappa, R., Shrivastava, A.: Pose and joint-aware action recognition. In: WACV (2022)
Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: ICCV (2021)
Sun, Y., Liu, W., Bao, Q., Fu, Y., Mei, T., Black, M.J.: Putting people in their place: monocular regression of 3D people in depth. In: CVPR (2022)
Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. PAMI (2020)
Weinzaepfel, P., et al.: CroCo v2: improved cross-view completion pre-training for stereo matching and optical flow. In: ICCV (2023)
Yang, Z., et al.: Synbody: synthetic dataset with layered human models for 3D human perception and modeling. In: ICCV (2023)
Yoshiyasu, Y.: Deformable mesh transformer for 3D human mesh recovery. In: CVPR (2023)
Zhang, H., et al.: PyMAF-X: towards well-aligned full-body model regression from monocular images. IEEE Trans. PAMI (2023)
Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV (2021)
Zhen, J., et al.: SMAP: single-shot multi-person absolute 3D pose estimation. In: ECCV (2020)
Zheng, C., Liu, X., Qi, G.J., Chen, C.: POTTER: pooling attention transformer for efficient human mesh recovery. In: CVPR (2023)
Zhou, L., Meng, X., Liu, Z., Wu, M., Gao, Z., Wang, P.: Human pose-based estimation, tracking and action recognition with deep learning: a survey. arXiv preprint arXiv:2310.13039 (2023)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhou, Y., Habermann, M., Habibie, I., Tewari, A., Theobalt, C., Xu, F.: Monocular real-time full body capture with inter-part correlations. In: CVPR (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Baradel, F. et al. (2025). Multi-HMR: Multi-person Whole-Body Human Mesh Recovery in a Single Shot. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15081. Springer, Cham. https://doi.org/10.1007/978-3-031-73337-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-73337-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73336-9
Online ISBN: 978-3-031-73337-6
eBook Packages: Computer ScienceComputer Science (R0)