[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Multi-HMR: Multi-person Whole-Body Human Mesh Recovery in a Single Shot

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15081))

Included in the following conference series:

  • 104 Accesses

Abstract

We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on \(448{\times }448\) images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

F. Baradel and T. Lucas—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 49.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://download.europe.naverlabs.com/ComputerVision/MultiHMR/CUFFS.

  2. 2.

    https://github.com/facebookresearch/fvcore.

References

  1. Blender. https://www.blender.org/

  2. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)

    Google Scholar 

  3. Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR (2023)

    Google Scholar 

  4. Cai, Z., et al.: SMPLer-X: scaling up expressive human pose and shape estimation. In: NeurIPS (2023)

    Google Scholar 

  5. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)

    Google Scholar 

  6. Choi, H., Moon, G., Park, J., Lee, K.M.: Learning to estimate robust 3D human mesh from in-the-wild crowded scenes. In: CVPR (2022)

    Google Scholar 

  7. Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. In: ECCV (2020)

    Google Scholar 

  8. De Santis, A., Siciliano, B., De Luca, A., Bicchi, A.: An atlas of physical human–robot interaction. Mechanism and Machine Theory (2008)

    Google Scholar 

  9. Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  10. Facil, J.M., Ummenhofer, B., Zhou, H., Montesano, L., Brox, T., Civera, J.: CAM-Convs: camera-aware multi-scale convolutions for single-view depth. In: CVPR (2019)

    Google Scholar 

  11. Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regression of expressive bodies using moderation. In: 3DV (2021)

    Google Scholar 

  12. Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: ICCV (2023)

    Google Scholar 

  13. Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)

    Google Scholar 

  14. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)

    Google Scholar 

  15. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: CVPR (2017)

    Google Scholar 

  16. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. PAMI (2013)

    Google Scholar 

  17. Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: ICML (2021)

    Google Scholar 

  18. Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., Daniilidis, K.: Coherent reconstruction of multiple humans from a single image. In: CVPR (2020)

    Google Scholar 

  19. Jocher, G., et al.: ultralytics/YOLOv5: v7.0 - YOLOv5 SOTA realtime instance segmentation (2022)

    Google Scholar 

  20. Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)

    Google Scholar 

  21. Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In: 3DV (2020)

    Google Scholar 

  22. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)

    Google Scholar 

  23. Kim, J., Gwon, M.G., Park, H., Kwon, H., Um, G.M., Kim, W.: Sampling is matter: point-guided 3D human mesh reconstruction. In: CVPR (2023)

    Google Scholar 

  24. Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: ICCV (2021)

    Google Scholar 

  25. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)

    Google Scholar 

  26. Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: CLIFF: carrying location information in full frames into human pose and shape estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp. 590–606. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_34

    Chapter  Google Scholar 

  27. Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3D whole-body mesh recovery with component aware transformer. In: CVPR (2023)

    Google Scholar 

  28. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  29. Liu, W., et al.: SSD: single shot multibox detector. In: ECCV (2016)

    Google Scholar 

  30. Ma, X., Su, J., Wang, C., Zhu, W., Wang, Y.: 3D human mesh estimation from virtual markers. In: CVPR (2023)

    Google Scholar 

  31. von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)

    Google Scholar 

  32. Mehta, D., et al.: XNect: real-time multi-person 3D motion capture with a single RGB camera. ACM Trans. Graph (2020)

    Google Scholar 

  33. Mehta, D., et al.: Single-shot multi-person 3d pose estimation from monocular RGB. In: 3DV (2018)

    Google Scholar 

  34. Mertan, A., Duff, D.J., Unal, G.: Single image depth estimation: an overview. Digital Signal Process. (2022)

    Google Scholar 

  35. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NERF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)

    Google Scholar 

  36. Moon, G., Choi, H., Chun, S., Lee, J., Yun, S.: Three recipes for better 3D pseudo-GTs of 3D human mesh estimation in the wild. In: CVPR Workshop (2023)

    Google Scholar 

  37. Moon, G., Choi, H., Lee, K.M.: Accurate 3D hand pose estimation for whole-body 3D human mesh estimation. In: CVPR Worskhop (2022)

    Google Scholar 

  38. Moon, G., Choi, H., Lee, K.M.: NeuralAnnot: neural annotator for 3D human mesh training sets. In: CVPR Worskhop (2022)

    Google Scholar 

  39. Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: ECCV (2020)

    Google Scholar 

  40. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. TMLR (2023)

    Google Scholar 

  41. Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: CVPR (2021)

    Google Scholar 

  42. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)

    Google Scholar 

  43. Qiu, Z., et al.: PSVT: end-to-end multi-person 3D pose and shape estimation with progressive video transformers. In: CVPR (2023)

    Google Scholar 

  44. Qiu, Z., Yang, Q., Wang, J., Fu, D.: Dynamic graph reasoning for multi-person 3D pose estimation. In: ACMMM (2022)

    Google Scholar 

  45. Rajasegaran, J., Pavlakos, G., Kanazawa, A., Feichtenhofer, C., Malik, J.: On the benefits of 3D pose and tracking for human action recognition. In: CVPR (2023)

    Google Scholar 

  46. Rajasegaran, J., Pavlakos, G., Kanazawa, A., Malik, J.: Tracking people by predicting 3D appearance, location and pose. In: CVPR (2022)

    Google Scholar 

  47. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)

    Google Scholar 

  48. Rong, Y., Shiratori, T., Joo, H.: FrankMocap: a monocular 3D whole-body pose estimation system via regression and integration. In: ICCV (2021)

    Google Scholar 

  49. Salzmann, T., Chiang, H.T.L., Ryll, M., Sadigh, D., Parada, C., Bewley, A.: Robots that can see: leveraging human pose for trajectory prediction. IEEE RAL (2023)

    Google Scholar 

  50. Shah, A., Mishra, S., Bansal, A., Chen, J.C., Chellappa, R., Shrivastava, A.: Pose and joint-aware action recognition. In: WACV (2022)

    Google Scholar 

  51. Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: ICCV (2021)

    Google Scholar 

  52. Sun, Y., Liu, W., Bao, Q., Fu, Y., Mei, T., Black, M.J.: Putting people in their place: monocular regression of 3D people in depth. In: CVPR (2022)

    Google Scholar 

  53. Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)

    Google Scholar 

  54. Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. PAMI (2020)

    Google Scholar 

  55. Weinzaepfel, P., et al.: CroCo v2: improved cross-view completion pre-training for stereo matching and optical flow. In: ICCV (2023)

    Google Scholar 

  56. Yang, Z., et al.: Synbody: synthetic dataset with layered human models for 3D human perception and modeling. In: ICCV (2023)

    Google Scholar 

  57. Yoshiyasu, Y.: Deformable mesh transformer for 3D human mesh recovery. In: CVPR (2023)

    Google Scholar 

  58. Zhang, H., et al.: PyMAF-X: towards well-aligned full-body model regression from monocular images. IEEE Trans. PAMI (2023)

    Google Scholar 

  59. Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV (2021)

    Google Scholar 

  60. Zhen, J., et al.: SMAP: single-shot multi-person absolute 3D pose estimation. In: ECCV (2020)

    Google Scholar 

  61. Zheng, C., Liu, X., Qi, G.J., Chen, C.: POTTER: pooling attention transformer for efficient human mesh recovery. In: CVPR (2023)

    Google Scholar 

  62. Zhou, L., Meng, X., Liu, Z., Wu, M., Gao, Z., Wang, P.: Human pose-based estimation, tracking and action recognition with deep learning: a survey. arXiv preprint arXiv:2310.13039 (2023)

  63. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

  64. Zhou, Y., Habermann, M., Habibie, I., Tewari, A., Theobalt, C., Xu, F.: Monocular real-time full body capture with inter-part correlations. In: CVPR (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabien Baradel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Baradel, F. et al. (2025). Multi-HMR: Multi-person Whole-Body Human Mesh Recovery in a Single Shot. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15081. Springer, Cham. https://doi.org/10.1007/978-3-031-73337-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73337-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73336-9

  • Online ISBN: 978-3-031-73337-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics