HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

Eugene Valassakis¹³ &
Guillermo Garcia-Hernando¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15096))

Included in the following conference series:

European Conference on Computer Vision

Abstract

Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous work typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. We also introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks.

E. Valassakis—Now at Synthesia. Work done while at Niantic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Dual Grid Net: Hand Mesh Vertex Regression from Single Depth Maps

3D hand pose and shape estimation from monocular RGB via efficient 2D cues

Article Open access 30 November 2023

Accurate 3D hand mesh recovery from a single RGB image

Article Open access 30 June 2022

References

Antotsiou, D., Garcia-Hernando, G., Kim, T.K.: Task-oriented hand motion retargeting for dexterous manipulation imitation. In: ECCV Workshop (2018)
Google Scholar
Apple: Vision Pro. https://www.apple.com/apple-vision-pro/. Accessed 7 Mar 2024
Armagan, A., et al.: Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3D hand pose estimation under hand-object interaction. In: ECCV (2020)
Google Scholar
Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: CVPR (2019)
Google Scholar
Baek, S., Kim, K.I., Kim, T.K.: Weakly-supervised domain adaptation via GAN and mesh model for estimating 3D hand poses interacting objects. In: CVPR (2020)
Google Scholar
Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Loopreg: self-supervised learning of implicit surface correspondences, pose and shape for 3D human mesh registration. In: NeurIPS (2020)
Google Scholar
Bhowmik, A., Gumhold, S., Rother, C., Brachmann, E.: Reinforced feature points: optimizing feature detection and description for a high-level task. In: CVPR (2020)
Google Scholar
Boukhayma, A., Bem, R.D., Torr, P.H.: 3D hand shape and pose from images in the wild. In: CVPR (2019)
Google Scholar
Brachmann, E., et al.: DSAC-differentiable RANSAC for camera localization. In: CVPR (2017)
Google Scholar
Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: CVPR (2021)
Google Scholar
Chen, B., Parra, A., Cao, J., Li, N., Chin, T.J.: End-to-end learnable geometric vision by backpropagating PNP optimization. In: CVPR (2020)
Google Scholar
Chen, H., Wang, P., Wang, F., Tian, W., Xiong, L., Li, H.: EPro-PnP: generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In: CVPR (2022)
Google Scholar
Chen, P., et al.: I2UV-HandNet: image-to-UV prediction network for accurate and high-fidelity 3D hand mesh modeling. In: ICCV (2021)
Google Scholar
Chen, X., et al.: Mobrecon: mobile-friendly hand mesh reconstruction from monocular image. In: CVPR (2022)
Google Scholar
Chen, X., et al.: Camera-space hand mesh recovery via semantic aggregation and adaptive 2D-1D registration. In: CVPR (2021)
Google Scholar
Chen, X., Wang, B., Shum, H.Y.: Hand avatar: free-pose hand animation and rendering from monocular video. In: CVPR (2023)
Google Scholar
Chen, Y., et al.: Model-based 3D hand reconstruction via self-supervised learning. In: CVPR (2021)
Google Scholar
Garcia-Hernando, G., Johns, E., Kim, T.K.: Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning. In: IROS (2020)
Google Scholar
Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: CVPR (2019)
Google Scholar
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and object poses. In: CVPR (2020)
Google Scholar
Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Keypoint transformer: solving joint identification in challenging hands and object interactions for accurate 3D pose estimation. In: CVPR (2022)
Google Scholar
Han, S., et al.: Megatrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM TOG (2020)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
Google Scholar
Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., Schmid, C.: Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: CVPR (2020)
Google Scholar
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Google Scholar
Huang, L., et al.: Neural voting field for camera-space 3D hand pose estimation. In: CVPR (2023)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI 36(7), 1325–1339 (2013)
Google Scholar
Iqbal, U., Molchanov, P., Breuel Juergen Gall, T., Kautz, J.: Hand pose estimation via latent 2.5D heatmap regression. In: ECCV (2018)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Google Scholar
Karunratanakul, K., Spurr, A., Fan, Z., Hilliges, O., Tang, S.: A skeleton-driven neural occupancy representation for articulated hands. In: 3DV (2021)
Google Scholar
Karunratanakul, K., Yang, J., Zhang, Y., Black, M.J., Muandet, K., Tang, S.: Grasping field: learning implicit representations for human grasps. In: 3DV (2020)
Google Scholar
Kulon, D., Guler, R.A., Kokkinos, I., Bronstein, M.M., Zafeiriou, S.: Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: CVPR (2020)
Google Scholar
Kuznetsova, A., et al.: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. IJCV 128(7), 1956–1981 (2020)
Article Google Scholar
Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: carrying location information in full frames into human pose and shape estimation. In: ECCV (2022)
Google Scholar
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR (2021)
Google Scholar
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: ICCV (2021)
Google Scholar
Meta: Quest 3. https://www.meta.com/us/quest/quest-3/. Accessed 7 Mar 2024
Mihajlovic, M., Zhang, Y., Black, M.J., Tang, S.: Leap: learning articulated occupancy of people. In: CVPR (2021)
Google Scholar
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)
Google Scholar
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: ECCV (2020)
Google Scholar
Park, J., Oh, Y., Moon, G., Choi, H., Lee, K.M.: Handoccnet: occlusion-robust 3D hand mesh estimation network. In: CVPR (2022)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Google Scholar
Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)
Google Scholar
Prince, S.J.: Computer Vision: Models, Learning, and Inference. Cambridge University Press, Cambridge (2012)
Book Google Scholar
Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.: Lightweight multi-view 3D pose estimation through camera-disentangled representation. In: CVPR (2020)
Google Scholar
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM TOG (2017)
Google Scholar
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: CVPR (2019)
Google Scholar
Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O., Kautz, J.: Weakly supervised 3D hand pose estimation via biomechanical constraints. In: ECCV (2020)
Google Scholar
Tang, X., Wang, T., Fu, C.W.: Towards accurate alignment in real-time 3D hand-mesh reconstruction. In: ICCV (2021)
Google Scholar
Wei, T., Patel, Y., Shekhovtsov, A., Matas, J., Barath, D.: Generalized differentiable RANSAC. In: ICCV (2023)
Google Scholar
Yin, W., et al.: Metric3D: towards zero-shot metric 3D prediction from a single image. In: ICCV (2023)
Google Scholar
Yuan, S., et al.: Depth-based 3D hand pose estimation: from current achievements to future goals. In: CVPR (2018)
Google Scholar
Zhang, X., et al.: Hand image understanding via deep multi-task learning. In: ICCV (2021)
Google Scholar
Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. In: ICCV (2019)
Google Scholar
Zhou, Y., Habermann, M., Xu, W., Habibie, I., Theobalt, C., Xu, F.: Monocular real-time hand shape and motion capture using multi-modal data. In: CVPR (2020)
Google Scholar
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: ICCV (2019)
Google Scholar

Download references

Acknowledgements

We would like to thank Filippo Aleotti for his help with baseline experiments and infrastructure; Jamie Watson, Zawar Qureshi, and Jakub Powierza for their help with infrastructure; Axel Laguna for his insightful discussions on minimal solvers and network architectures; Daniyar Turmukhambetov for valuable technical discussions; and Gabriel Brostow, Sara Vicente, Jessica Van Brummelen, and Michael Firman for their valuable feedback on different versions of the manuscript.

Author information

Authors and Affiliations

Niantic, San Francisco, USA
Eugene Valassakis & Guillermo Garcia-Hernando

Authors

Eugene Valassakis
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Garcia-Hernando
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillermo Garcia-Hernando .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valassakis, E., Garcia-Hernando, G. (2025). HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-72920-1_27
Published: 01 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72919-5
Online ISBN: 978-3-031-72920-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Dual Grid Net: Hand Mesh Vertex Regression from Single Depth Maps

3D hand pose and shape estimation from monocular RGB via efficient 2D cues

Accurate 3D hand mesh recovery from a single RGB image

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Dual Grid Net: Hand Mesh Vertex Regression from Single Depth Maps

3D hand pose and shape estimation from monocular RGB via efficient 2D cues

Accurate 3D hand mesh recovery from a single RGB image

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation