[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12370))

Included in the following conference series:

Abstract

3D hand pose estimation is still far from a well-solved problem mainly due to the highly nonlinear dynamics of hand pose and the difficulties of modeling its inherent structural dependencies. To address this issue, we connect this structured output learning problem with the structured modeling framework in sequence transduction field. Standard transduction models like Transformer adopt an autoregressive connection to capture dependencies from previously generated tokens and further correlate this information with the input sequence in order to prioritize the set of relevant input tokens for current token generation. To borrow wisdom from this structured learning framework while avoiding the sequential modeling for hand pose, taking a 3D point set as input, we propose to leverage the Transformer architecture with a novel non-autoregressive structured decoding mechanism. Specifically, instead of using previously generated results, our decoder utilizes a reference hand pose to provide equivalent dependencies among hand joints for each output joint generation. By imposing the reference structural dependencies, we can correlate the information with the input 3D points through a multi-head attention mechanism, aiming to discover informative points from different perspectives, towards each hand joint localization. We demonstrate our model’s effectiveness over multiple challenging hand pose datasets, comparing with several state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  2. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS (2015)

    Google Scholar 

  3. Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3D hand pose estimation from monocular RGB images. In: ECCV (2018)

    Google Scholar 

  4. Cai, Y., et al.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: ICCV (2019)

    Google Scholar 

  5. Chaudhari, S., Polatkan, G., Ramanath, R., Mithal, V.: An attentive survey of attention models. arXiv preprint arXiv:1904.02874 (2019)

  6. Chen, X., Wang, G., Guo, H., Zhang, C.: Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 395, 138–149 (2019)

    Article  Google Scholar 

  7. Chen, X., Wang, G., Zhang, C., Kim, T.K., Ji, X.: Shpr-net: deep semantic hand pose regression from point clouds. IEEE Access 6, 43425–43439 (2018)

    Article  Google Scholar 

  8. Chen, Y., Tu, Z., Ge, L., Zhang, D., Chen, R., Yuan, J.: SO-HandNet: self-organizing network for 3D hand pose estimation with semi-supervised learning. In: ICCV (2019)

    Google Scholar 

  9. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  10. Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: ICLR (2019)

    Google Scholar 

  11. Deng, X., Yang, S., Zhang, Y., Tan, P., Chang, L., Wang, H.: Hand3D: hand pose estimation using 3D neural network. arXiv preprint arXiv:1704.02224 (2017)

  12. Du, K., Lin, X., Sun, Y., Ma, X.: Crossinfonet: multi-task information sharing based hand pose estimation. In: CVPR (2019)

    Google Scholar 

  13. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-d videos and 3D hand pose annotations. In: CVPR (2018)

    Google Scholar 

  14. Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand pointnet: 3D hand pose estimation using point sets. In: CVPR (2018)

    Google Scholar 

  15. Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: CVPR (2016)

    Google Scholar 

  16. Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: CVPR (2017)

    Google Scholar 

  17. Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: CVPR (2019)

    Google Scholar 

  18. Ge, L., Ren, Z., Yuan, J.: Point-to-point regression pointnet for 3D hand pose estimation. In: ECCV (2018)

    Google Scholar 

  19. Gu, J., Bradbury, J., Xiong, C., Li, V.O., Socher, R.: Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017)

  20. Guo, H., Wang, G., Chen, X., Zhang, C., Qiao, F., Yang, H.: Region ensemble network: improving convolutional network for hand pose estimation. In: ICIP (2017)

    Google Scholar 

  21. Guo, J., Tan, X., He, D., Qin, T., Xu, L., Liu, T.Y.: Non-autoregressive neural machine translation with enhanced decoder input. In: AAAI (2019)

    Google Scholar 

  22. Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)

    Google Scholar 

  23. Iqbal, U., Molchanov, P., Breuel Juergen Gall, T., Kautz, J.: Hand pose estimation via latent 2.5 d heatmap regression. In: ECCV (2018)

    Google Scholar 

  24. Kim, Y., Denton, C., Hoang, L., Rush, A.M.: Structured attention networks. arXiv preprint arXiv:1702.00887 (2017)

  25. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  26. Li, S., Lee, D.: Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In: CVPR (2019)

    Google Scholar 

  27. Lin, J., Wu, Y., Huang, T.S.: Modeling the constraints of human hand motion. In: Proceedings Workshop on Human Motion (2000)

    Google Scholar 

  28. Moon, G., Chang, J., Lee, K.M.: V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: CVPR (2018)

    Google Scholar 

  29. Mueller, F., et al.: Ganerated hands for real-time 3D hand tracking from monocular RGB. In: CVPR (2018)

    Google Scholar 

  30. Oberweger, M., Lepetit, V.: DeepPrior++: improving fast and accurate 3D hand pose estimation. In: ICCV Workshop (2017)

    Google Scholar 

  31. Oberweger, M., Wohlhart, P., Lepetit, V.: Hands deep in deep learning for hand pose estimation. In: CVWW (2015)

    Google Scholar 

  32. Oberweger, M., Wohlhart, P., Lepetit, V.: Training a feedback loop for hand pose estimation. In: ICCV (2015)

    Google Scholar 

  33. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)

    Google Scholar 

  34. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NIPS (2017)

    Google Scholar 

  35. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. In: NIPS (2019)

    Google Scholar 

  36. Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cascaded hand pose regression. In: CVPR (2015)

    Google Scholar 

  37. Sun, Z., Li, Z., Wang, H., He, D., Lin, Z., Deng, Z.: Fast structured decoding for sequence models. In: NIPS (2019)

    Google Scholar 

  38. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)

    Google Scholar 

  39. Tang, D., Jin Chang, H., Tejani, A., Kim, T.K.: Latent regression forest: structured estimation of 3D articulated hand posture. In: CVPR (2014)

    Google Scholar 

  40. Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. (TOG) 33(5), 169 (2014)

    Article  Google Scholar 

  41. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)

    Google Scholar 

  42. Wan, C., Probst, T., Gool, L.V., Yao, A.: Self-supervised 3D hand pose estimation through training by fitting. In: CVPR (2019)

    Google Scholar 

  43. Wan, C., Probst, T., Van Gool, L., Yao, A.: Crossing nets: dual generative models with a shared latent space for hand pose estimation. In: CVPR (2017)

    Google Scholar 

  44. Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3D regression for hand pose estimation. In: CVPR (2018)

    Google Scholar 

  45. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)

    Google Scholar 

  46. Wang, Y., Tian, F., He, D., Qin, T., Zhai, C., Liu, T.Y.: Non-autoregressive machine translation with auxiliary regularization. In: AAAI (2019)

    Google Scholar 

  47. Xiong, F., et al.: A2J: anchor-to-joint regression network for 3D articulated pose estimation from a single depth image. In: ICCV (2019)

    Google Scholar 

  48. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)

    Google Scholar 

  49. Yang, L., Li, S., Lee, D., Yao, A.: Aligning latent spaces for 3D hand pose estimation. In: ICCV (2019)

    Google Scholar 

  50. Yuan, S., Ye, Q., Garcia-Hernando, G., Kim, T.K.: The 2017 hands in the million challenge on 3D hand pose estimation. arXiv preprint arXiv:1707.02237 (2017)

  51. Yuan, S., Ye, Q., Stenger, B., Jain, S., Kim, T.K.: BigHand2.2M benchmark: hand pose dataset and state of the art analysis. In: CVPR (2017)

    Google Scholar 

  52. Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. In: ICCV (2019)

    Google Scholar 

  53. Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: Deep kinematic pose regression. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 186–201. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_17

    Chapter  Google Scholar 

  54. Zhou, X., Wan, Q., Zhang, W., Xue, X., Wei, Y.: Model-based deep hand pose estimation. In: IJCAI (2016)

    Google Scholar 

  55. Zhou, Y., Lu, J., Du, K., Lin, X., Sun, Y., Ma, X.: HBE: hand branch ensemble network for real-time 3D hand pose estimation. In: ECCV (2018)

    Google Scholar 

  56. Zhu, X., Cheng, D., Zhang, Z., Lin, S., Dai, J.: An empirical study of spatial attention mechanisms in deep networks. In: ICCV (2019)

    Google Scholar 

  57. Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: ICCV (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lin Huang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 30084 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, L., Tan, J., Liu, J., Yuan, J. (2020). Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12370. Springer, Cham. https://doi.org/10.1007/978-3-030-58595-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58595-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58594-5

  • Online ISBN: 978-3-030-58595-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics