Abstract
Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous work typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. We also introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks.
E. Valassakis—Now at Synthesia. Work done while at Niantic.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Antotsiou, D., Garcia-Hernando, G., Kim, T.K.: Task-oriented hand motion retargeting for dexterous manipulation imitation. In: ECCV Workshop (2018)
Apple: Vision Pro. https://www.apple.com/apple-vision-pro/. Accessed 7 Mar 2024
Armagan, A., et al.: Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3D hand pose estimation under hand-object interaction. In: ECCV (2020)
Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: CVPR (2019)
Baek, S., Kim, K.I., Kim, T.K.: Weakly-supervised domain adaptation via GAN and mesh model for estimating 3D hand poses interacting objects. In: CVPR (2020)
Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Loopreg: self-supervised learning of implicit surface correspondences, pose and shape for 3D human mesh registration. In: NeurIPS (2020)
Bhowmik, A., Gumhold, S., Rother, C., Brachmann, E.: Reinforced feature points: optimizing feature detection and description for a high-level task. In: CVPR (2020)
Boukhayma, A., Bem, R.D., Torr, P.H.: 3D hand shape and pose from images in the wild. In: CVPR (2019)
Brachmann, E., et al.: DSAC-differentiable RANSAC for camera localization. In: CVPR (2017)
Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: CVPR (2021)
Chen, B., Parra, A., Cao, J., Li, N., Chin, T.J.: End-to-end learnable geometric vision by backpropagating PNP optimization. In: CVPR (2020)
Chen, H., Wang, P., Wang, F., Tian, W., Xiong, L., Li, H.: EPro-PnP: generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In: CVPR (2022)
Chen, P., et al.: I2UV-HandNet: image-to-UV prediction network for accurate and high-fidelity 3D hand mesh modeling. In: ICCV (2021)
Chen, X., et al.: Mobrecon: mobile-friendly hand mesh reconstruction from monocular image. In: CVPR (2022)
Chen, X., et al.: Camera-space hand mesh recovery via semantic aggregation and adaptive 2D-1D registration. In: CVPR (2021)
Chen, X., Wang, B., Shum, H.Y.: Hand avatar: free-pose hand animation and rendering from monocular video. In: CVPR (2023)
Chen, Y., et al.: Model-based 3D hand reconstruction via self-supervised learning. In: CVPR (2021)
Garcia-Hernando, G., Johns, E., Kim, T.K.: Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning. In: IROS (2020)
Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: CVPR (2019)
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and object poses. In: CVPR (2020)
Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Keypoint transformer: solving joint identification in challenging hands and object interactions for accurate 3D pose estimation. In: CVPR (2022)
Han, S., et al.: Megatrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM TOG (2020)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., Schmid, C.: Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: CVPR (2020)
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Huang, L., et al.: Neural voting field for camera-space 3D hand pose estimation. In: CVPR (2023)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI 36(7), 1325–1339 (2013)
Iqbal, U., Molchanov, P., Breuel Juergen Gall, T., Kautz, J.: Hand pose estimation via latent 2.5D heatmap regression. In: ECCV (2018)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Karunratanakul, K., Spurr, A., Fan, Z., Hilliges, O., Tang, S.: A skeleton-driven neural occupancy representation for articulated hands. In: 3DV (2021)
Karunratanakul, K., Yang, J., Zhang, Y., Black, M.J., Muandet, K., Tang, S.: Grasping field: learning implicit representations for human grasps. In: 3DV (2020)
Kulon, D., Guler, R.A., Kokkinos, I., Bronstein, M.M., Zafeiriou, S.: Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: CVPR (2020)
Kuznetsova, A., et al.: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. IJCV 128(7), 1956–1981 (2020)
Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: carrying location information in full frames into human pose and shape estimation. In: ECCV (2022)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR (2021)
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: ICCV (2021)
Meta: Quest 3. https://www.meta.com/us/quest/quest-3/. Accessed 7 Mar 2024
Mihajlovic, M., Zhang, Y., Black, M.J., Tang, S.: Leap: learning articulated occupancy of people. In: CVPR (2021)
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: ECCV (2020)
Park, J., Oh, Y., Moon, G., Choi, H., Lee, K.M.: Handoccnet: occlusion-robust 3D hand mesh estimation network. In: CVPR (2022)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)
Prince, S.J.: Computer Vision: Models, Learning, and Inference. Cambridge University Press, Cambridge (2012)
Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.: Lightweight multi-view 3D pose estimation through camera-disentangled representation. In: CVPR (2020)
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM TOG (2017)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: CVPR (2019)
Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O., Kautz, J.: Weakly supervised 3D hand pose estimation via biomechanical constraints. In: ECCV (2020)
Tang, X., Wang, T., Fu, C.W.: Towards accurate alignment in real-time 3D hand-mesh reconstruction. In: ICCV (2021)
Wei, T., Patel, Y., Shekhovtsov, A., Matas, J., Barath, D.: Generalized differentiable RANSAC. In: ICCV (2023)
Yin, W., et al.: Metric3D: towards zero-shot metric 3D prediction from a single image. In: ICCV (2023)
Yuan, S., et al.: Depth-based 3D hand pose estimation: from current achievements to future goals. In: CVPR (2018)
Zhang, X., et al.: Hand image understanding via deep multi-task learning. In: ICCV (2021)
Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. In: ICCV (2019)
Zhou, Y., Habermann, M., Xu, W., Habibie, I., Theobalt, C., Xu, F.: Monocular real-time hand shape and motion capture using multi-modal data. In: CVPR (2020)
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: ICCV (2019)
Acknowledgements
We would like to thank Filippo Aleotti for his help with baseline experiments and infrastructure; Jamie Watson, Zawar Qureshi, and Jakub Powierza for their help with infrastructure; Axel Laguna for his insightful discussions on minimal solvers and network architectures; Daniyar Turmukhambetov for valuable technical discussions; and Gabriel Brostow, Sara Vicente, Jessica Van Brummelen, and Michael Firman for their valuable feedback on different versions of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Valassakis, E., Garcia-Hernando, G. (2025). HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-72920-1_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72919-5
Online ISBN: 978-3-031-72920-1
eBook Packages: Computer ScienceComputer Science (R0)