More Web Proxy on the site http://driver.im/

Article

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

Authors:

Eugene Valassakis,

Guillermo Garcia-HernandoAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII

Pages 479 - 496

https://doi.org/10.1007/978-3-031-72920-1_27

Published: 01 October 2024 Publication History

Abstract

Predicting camera-space hand meshes from single RGB images is crucial for enabling realistic hand interactions in 3D virtual and augmented worlds. Previous work typically divided the task into two stages: given a cropped image of the hand, predict meshes in relative coordinates, followed by lifting these predictions into camera space in a separate and independent stage, often resulting in the loss of valuable contextual and scale information. To prevent the loss of these cues, we propose unifying these two stages into an end-to-end solution that addresses the 2D-3D correspondence problem. This solution enables back-propagation from camera space outputs to the rest of the network through a new differentiable global positioning module. We also introduce an image rectification step that harmonizes both the training dataset and the input image as if they were acquired with the same camera, helping to alleviate the inherent scale-depth ambiguity of the problem. We validate the effectiveness of our framework in evaluations against several baselines and state-of-the-art approaches across three public benchmarks.

References

[1]

Antotsiou, D., Garcia-Hernando, G., Kim, T.K.: Task-oriented hand motion retargeting for dexterous manipulation imitation. In: ECCV Workshop (2018)

[2]

Apple: Vision Pro. https://www.apple.com/apple-vision-pro/. Accessed 7 Mar 2024

[3]

Armagan, A., et al.: Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3D hand pose estimation under hand-object interaction. In: ECCV (2020)

[4]

Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: CVPR (2019)

[5]

Baek, S., Kim, K.I., Kim, T.K.: Weakly-supervised domain adaptation via GAN and mesh model for estimating 3D hand poses interacting objects. In: CVPR (2020)

[6]

Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Loopreg: self-supervised learning of implicit surface correspondences, pose and shape for 3D human mesh registration. In: NeurIPS (2020)

[7]

Bhowmik, A., Gumhold, S., Rother, C., Brachmann, E.: Reinforced feature points: optimizing feature detection and description for a high-level task. In: CVPR (2020)

[8]

Boukhayma, A., Bem, R.D., Torr, P.H.: 3D hand shape and pose from images in the wild. In: CVPR (2019)

[9]

Brachmann, E., et al.: DSAC-differentiable RANSAC for camera localization. In: CVPR (2017)

[10]

Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: CVPR (2021)

[11]

Chen, B., Parra, A., Cao, J., Li, N., Chin, T.J.: End-to-end learnable geometric vision by backpropagating PNP optimization. In: CVPR (2020)

[12]

Chen, H., Wang, P., Wang, F., Tian, W., Xiong, L., Li, H.: EPro-PnP: generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In: CVPR (2022)

[13]

Chen, P., et al.: I2UV-HandNet: image-to-UV prediction network for accurate and high-fidelity 3D hand mesh modeling. In: ICCV (2021)

[14]

Chen, X., et al.: Mobrecon: mobile-friendly hand mesh reconstruction from monocular image. In: CVPR (2022)

[15]

Chen, X., et al.: Camera-space hand mesh recovery via semantic aggregation and adaptive 2D-1D registration. In: CVPR (2021)

[16]

Chen, X., Wang, B., Shum, H.Y.: Hand avatar: free-pose hand animation and rendering from monocular video. In: CVPR (2023)

[17]

Chen, Y., et al.: Model-based 3D hand reconstruction via self-supervised learning. In: CVPR (2021)

[18]

Garcia-Hernando, G., Johns, E., Kim, T.K.: Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning. In: IROS (2020)

[19]

Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: CVPR (2019)

[20]

Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and object poses. In: CVPR (2020)

[21]

Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Keypoint transformer: solving joint identification in challenging hands and object interactions for accurate 3D pose estimation. In: CVPR (2022)

[22]

Han, S., et al.: Megatrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM TOG (2020)

[23]

Hartley R and Zisserman A Multiple View Geometry in Computer Vision 2003 Cambridge Cambridge University Press

Digital Library

[24]

Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., Schmid, C.: Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: CVPR (2020)

[25]

Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)

[26]

Huang, L., et al.: Neural voting field for camera-space 3D hand pose estimation. In: CVPR (2023)

[27]

Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI 36(7), 1325–1339 (2013)

[28]

Iqbal, U., Molchanov, P., Breuel Juergen Gall, T., Kautz, J.: Hand pose estimation via latent 2.5D heatmap regression. In: ECCV (2018)

[29]

Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)

[30]

Karunratanakul, K., Spurr, A., Fan, Z., Hilliges, O., Tang, S.: A skeleton-driven neural occupancy representation for articulated hands. In: 3DV (2021)

[31]

Karunratanakul, K., Yang, J., Zhang, Y., Black, M.J., Muandet, K., Tang, S.: Grasping field: learning implicit representations for human grasps. In: 3DV (2020)

[32]

Kulon, D., Guler, R.A., Kokkinos, I., Bronstein, M.M., Zafeiriou, S.: Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: CVPR (2020)

[33]

Kuznetsova A et al. The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale IJCV 2020 128 7 1956-1981

[34]

Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: carrying location information in full frames into human pose and shape estimation. In: ECCV (2022)

[35]

Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR (2021)

[36]

Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: ICCV (2021)

[37]

Meta: Quest 3. https://www.meta.com/us/quest/quest-3/. Accessed 7 Mar 2024

[38]

Mihajlovic, M., Zhang, Y., Black, M.J., Tang, S.: Leap: learning articulated occupancy of people. In: CVPR (2021)

[39]

Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)

[40]

Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: ECCV (2020)

[41]

Park, J., Oh, Y., Moon, G., Choi, H., Lee, K.M.: Handoccnet: occlusion-robust 3D hand mesh estimation network. In: CVPR (2022)

[42]

Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)

[43]

Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)

[44]

Prince SJ Computer Vision: Models, Learning, and Inference 2012 Cambridge Cambridge University Press

[45]

Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.: Lightweight multi-view 3D pose estimation through camera-disentangled representation. In: CVPR (2020)

[46]

Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM TOG (2017)

[47]

Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: CVPR (2019)

[48]

Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O., Kautz, J.: Weakly supervised 3D hand pose estimation via biomechanical constraints. In: ECCV (2020)

[49]

Tang, X., Wang, T., Fu, C.W.: Towards accurate alignment in real-time 3D hand-mesh reconstruction. In: ICCV (2021)

[50]

Wei, T., Patel, Y., Shekhovtsov, A., Matas, J., Barath, D.: Generalized differentiable RANSAC. In: ICCV (2023)

[51]

Yin, W., et al.: Metric3D: towards zero-shot metric 3D prediction from a single image. In: ICCV (2023)

[52]

Yuan, S., et al.: Depth-based 3D hand pose estimation: from current achievements to future goals. In: CVPR (2018)

[53]

Zhang, X., et al.: Hand image understanding via deep multi-task learning. In: ICCV (2021)

[54]

Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. In: ICCV (2019)

[55]

Zhou, Y., Habermann, M., Xu, W., Habibie, I., Theobalt, C., Xu, F.: Monocular real-time hand shape and motion capture using multi-modal data. In: CVPR (2020)

[56]

Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: ICCV (2019)

Index Terms

HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

Index terms have been assigned to the content through auto-classification.

Recommendations

Hand-Eye Camera Calibration with an Optical Tracking System
ICDSC '18: Proceedings of the 12th International Conference on Distributed Smart Cameras

This paper presents a method for hand-eye camera calibration via an optical tracking system (OTS) faciltating robotic applications. The camera pose cannot be directly tracked via the OTS. Because of this, a transformation matrix between a marker-plate ...
Global hand pose estimation by multiple camera ellipse tracking

Immersive virtual environments with life-like interaction capabilities have very demanding requirements including high-precision motion capture and high-processing speed. These issues raise many challenges for computer vision-based motion estimation ...
Visual Modeling with a Hand-Held Camera

In this paper a complete system to build visual models from camera images is presented. The system can deal with uncalibrated image sequences acquired with a hand-held camera. Based on tracked or matched features the relations between multiple views are ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII

Sep 2024

583 pages

ISBN:978-3-031-72919-5

DOI:10.1007/978-3-031-72920-1

Editors:
Aleš Leonardis
University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
Technical University of Darmstadt, Darmstadt, Hessen, Germany
,
Olga Russakovsky
Princeton University, Palo Alto, CA, USA
,
Torsten Sattler
Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 October 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents