EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset

Amy Zhao¹³,
Chengcheng Tang¹³,
Lezi Wang¹³,
Yijing Li¹³,
Mihika Dave¹³,
Lingling Tao¹³,
Christopher D. Twigg¹³ &
…
Robert Y. Wang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15137))

Included in the following conference series:

European Conference on Computer Vision

97 Accesses

Abstract

Accurate tracking of a user’s body pose while wearing a virtual reality (VR), augmented reality (AR) or mixed reality (MR) headset is a prerequisite for authentic self-expression, natural social presence, and intuitive user interfaces. Existing body tracking approaches on VR/AR devices are either under-constrained, e.g., attempting to infer full body pose from only headset and controller pose, or require impractical hardware setups that place cameras far from a user’s face to improve body visibility. In this paper, we present the first controller-less egocentric body tracking solution that runs on an actual VR device using the same cameras that are used for SLAM tracking. We propose a novel egocentric tracking architecture that models the temporal history of body motion using multi-view latent features. Furthermore, we release the first large-scale real-image dataset for egocentric body tracking, EgoBody3M, with a realistic VR headset configuration and diverse subjects and motions. Benchmarks on the dataset shows that our approach outperforms other state-of-the-art methods in both accuracy and smoothness of the resulting motion. We perform ablation studies on our model choices and demonstrate the method running in realtime on a VR headset. Our dataset with more than 30 h of recordings and 3 million frames will be made publicly available.

A. Zhao and C. Tang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Real-time body tracking in virtual reality using a Vive tracker

Article 23 November 2018

References

Apple Vision Pro. https://www.apple.com/apple-vision-pro/. Accessed 17 Nov 2023
Meta Quest 3. https://www.meta.com/ie/quest/quest-3/. Accessed 17 Nov 2023
Meta Quest Pro. https://www.meta.com/quest/quest-pro/. Accessed 17 Nov 2023
Microsoft Azure Kinect. https://azure.microsoft.com/en-us/products/kinect-dk. Accessed 17 Nov 2023
Pico 4. https://www.picoxr.com/global/products/pico4. Accessed 17 Nov 2023
Akada, H., Wang, J., Shimada, S., Takahashi, M., Theobalt, C., Golyanik, V.: UnrealEgo: a new dataset for robust egocentric 3D human motion capture. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_1
Chapter Google Scholar
Allen, B., Curless, B., Popović, Z.: The space of human body shapes: reconstruction and parameterization from range scans. ACM Trans. Graph. 22(3), 587–594 (2003)
Article Google Scholar
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., Grundmann, M.: BlazePose: on-device real-time body pose tracking. In: CVPR Workshop on Computer Vision for Augmented and Virtual Reality (2020)
Google Scholar
Cha, Y.W., et al.: Towards fully mobile 3D face, body, and environment capture using only head-worn cameras. IEEE Trans. Vis. Comput. Graph. 24(11), 2993–3004 (2018). https://doi.org/10.1109/TVCG.2018.2868527
Article Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Dittadi, A., Dziadzio, S., Cosker, D., Lundell, B., Cashman, T., Shotton, J.: Full-body motion from a single head-mounted device: generating SMPL poses from partial observations. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981). https://doi.org/10.1145/358669.358692
Article MathSciNet Google Scholar
Hähnel, D., Thrun, S., Burgard, W.: An extension of the ICP algorithm for modeling nonrigid objects with mobile robots. In: Proceedings of IJCAI (2003)
Google Scholar
Han, S., et al.: UmeTrack: unified multi-view end-to-end hand tracking for VR. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)
Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR (1994)
Google Scholar
Hu, W., Zhang, C., Zhan, F., Zhang, L., Wong, T.T.: Conditional directed graph convolution for 3D human pose estimation. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 602–611 (2021)
Google Scholar
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7718–7727 (2019)
Google Scholar
Ito, K., Tada, M., Ujike, H., Hyodo, K.: Effects of the weight and balance of head-mounted displays on physical load. Appl. Sci. 11(15), 6802 (2021)
Article Google Scholar
Jeon, H.G., Lee, J.Y., Im, S., Ha, H., Kweon, I.S.: Stereo matching with color and monochrome cameras in low-light conditions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4086–4094 (2016)
Google Scholar
Jiang, H., Ithapu, V.K.: Egocentric pose estimation from human vision span. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10986–10994. IEEE (2021)
Google Scholar
Khirodkar, R., Bansal, A., Ma, L., Newcombe, R., Vo, M., Kitani, K.: EgoHumans: an egocentric 3D multi-human benchmark. arXiv preprint arXiv:2305.16487 (2023)
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)
Google Scholar
Li, J., Liu, C., Wu, J.: Ego-body pose estimation via ego-head pose estimation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society (2023)
Google Scholar
Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014, Part II. LNCS, vol. 9004, pp. 332–347. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16808-1_23
Chapter Google Scholar
Pantone LLC: Pantone SkinTone Guide (2012)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VIII. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Parger, M., Mueller, J.H., Schmalstieg, D., Steinberger, M.: Human upper-body inverse kinematics for increased embodiment in consumer-grade virtual reality. In: Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology, VRST 2018 (2018)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Google Scholar
Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.: Lightweight multi-view 3D pose estimation through camera-disentangled representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6040–6049 (2020)
Google Scholar
Rhodin, H., et al.: EgoCap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. 35(6), 1–11 (2016)
Article Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Proceedings Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152 (2001)
Google Scholar
Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 (2014)
Smith, L., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates, p. 36 (2019). https://doi.org/10.1117/12.2520589
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P.: Structured prediction of 3D human pose with deep neural networks. arXiv preprint arXiv:1605.05180 (2016)
Tekin, B., Rozantsev, A., Lepetit, V., Fua, P.V.: Direct prediction of 3D body poses from motion compensated sequences. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Tome, D., et al.: SelfPose: 3D egocentric pose estimation from a headset mounted camera. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6794–6806 (2020). https://doi.org/10.1109/TPAMI.2020.3029700
Article Google Scholar
Tome, D., Peluse, P., Agapito, L., Badino, H.: xR-EgoPose: egocentric 3D human pose from an HMD camera. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7728–7738 (2019)
Google Scholar
Wang, J., Liu, L., Xu, W., Sarkar, K., Luvizon, D., Theobalt, C.: Estimating egocentric 3D human pose in the wild with external weak supervision. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society (2022)
Google Scholar
Wang, J., Luvizon, D., Xu, W., Liu, L., Sarkar, K., Theobalt, C.: Scene-aware egocentric 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3D pose estimation from videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 764–780. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_45
Chapter Google Scholar
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Article Google Scholar
Winkler, A., Won, J., Ye, Y.: QuestSim: human motion tracking from sparse sensors with simulated avatars. In: SIGGRAPH Asia 2022 Conference Papers (2022)
Google Scholar
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481 (2018)
Google Scholar
Xu, W., et al.: Mo\(^{2}\)Cap\(^{2}\): real-time mobile 3D motion capture with a cap-mounted fisheye camera. IEEE Trans. Vis. Comput. Graph. 25(5), 2093–2101 (2019)
Article Google Scholar
Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13232–13242 (2022)
Google Scholar
Zhang, Y., You, S., Gevers, T.: Automatic calibration of the fisheye camera for egocentric 3D human pose estimation from a single image. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1772–1781 (2021)
Google Scholar
Zhao, D., Wei, Z., Mahmud, J., Frahm, J.M.: EgoGlass: egocentric-view human pose estimation from an eyeglass frame. In: 2021 International Conference on 3D Vision (3DV), pp. 32–41 (2021)
Google Scholar
Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: MotionBERT: a unified perspective on learning human motion representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15085–15099 (2023)
Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge the anonymous reviewers for their comments and corrections; Xuetong Sun and Fan Bu for their work on productizing body tracking; Steve Olsen, Kevin Harris, Steve Miller, Kaichen Sun, Ben Watson, Matthew Prasak, Daniel Frey, Gunnar Grismore, Andrew Anderson, Mark Hogan, and Neha Chachra for their help with data collection; David Dimond and Weijie Yu for their help with annotation; and Anastasia Tkach for machine learning development and experiments.

Author information

Authors and Affiliations

Meta Platforms Inc., Menlo Park, USA
Amy Zhao, Chengcheng Tang, Lezi Wang, Yijing Li, Mihika Dave, Lingling Tao, Christopher D. Twigg & Robert Y. Wang

Authors

Amy Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Chengcheng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Lezi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yijing Li
View author publications
You can also search for this author in PubMed Google Scholar
Mihika Dave
View author publications
You can also search for this author in PubMed Google Scholar
Lingling Tao
View author publications
You can also search for this author in PubMed Google Scholar
Christopher D. Twigg
View author publications
You can also search for this author in PubMed Google Scholar
Robert Y. Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher D. Twigg .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9879 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, A. et al. (2025). EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15137. Springer, Cham. https://doi.org/10.1007/978-3-031-72986-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-72986-7_22
Published: 02 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72985-0
Online ISBN: 978-3-031-72986-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Real-time body tracking in virtual reality using a Vive tracker

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9879 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Real-time body tracking in virtual reality using a Vive tracker

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9879 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation