Abstract
Accurate tracking of a user’s body pose while wearing a virtual reality (VR), augmented reality (AR) or mixed reality (MR) headset is a prerequisite for authentic self-expression, natural social presence, and intuitive user interfaces. Existing body tracking approaches on VR/AR devices are either under-constrained, e.g., attempting to infer full body pose from only headset and controller pose, or require impractical hardware setups that place cameras far from a user’s face to improve body visibility. In this paper, we present the first controller-less egocentric body tracking solution that runs on an actual VR device using the same cameras that are used for SLAM tracking. We propose a novel egocentric tracking architecture that models the temporal history of body motion using multi-view latent features. Furthermore, we release the first large-scale real-image dataset for egocentric body tracking, EgoBody3M, with a realistic VR headset configuration and diverse subjects and motions. Benchmarks on the dataset shows that our approach outperforms other state-of-the-art methods in both accuracy and smoothness of the resulting motion. We perform ablation studies on our model choices and demonstrate the method running in realtime on a VR headset. Our dataset with more than 30 h of recordings and 3 million frames will be made publicly available.
A. Zhao and C. Tang—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Apple Vision Pro. https://www.apple.com/apple-vision-pro/. Accessed 17 Nov 2023
Meta Quest 3. https://www.meta.com/ie/quest/quest-3/. Accessed 17 Nov 2023
Meta Quest Pro. https://www.meta.com/quest/quest-pro/. Accessed 17 Nov 2023
Microsoft Azure Kinect. https://azure.microsoft.com/en-us/products/kinect-dk. Accessed 17 Nov 2023
Pico 4. https://www.picoxr.com/global/products/pico4. Accessed 17 Nov 2023
Akada, H., Wang, J., Shimada, S., Takahashi, M., Theobalt, C., Golyanik, V.: UnrealEgo: a new dataset for robust egocentric 3D human motion capture. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_1
Allen, B., Curless, B., Popović, Z.: The space of human body shapes: reconstruction and parameterization from range scans. ACM Trans. Graph. 22(3), 587–594 (2003)
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., Grundmann, M.: BlazePose: on-device real-time body pose tracking. In: CVPR Workshop on Computer Vision for Augmented and Virtual Reality (2020)
Cha, Y.W., et al.: Towards fully mobile 3D face, body, and environment capture using only head-worn cameras. IEEE Trans. Vis. Comput. Graph. 24(11), 2993–3004 (2018). https://doi.org/10.1109/TVCG.2018.2868527
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Dittadi, A., Dziadzio, S., Cosker, D., Lundell, B., Cashman, T., Shotton, J.: Full-body motion from a single head-mounted device: generating SMPL poses from partial observations. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981). https://doi.org/10.1145/358669.358692
Hähnel, D., Thrun, S., Burgard, W.: An extension of the ICP algorithm for modeling nonrigid objects with mobile robots. In: Proceedings of IJCAI (2003)
Han, S., et al.: UmeTrack: unified multi-view end-to-end hand tracking for VR. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR (1994)
Hu, W., Zhang, C., Zhan, F., Zhang, L., Wong, T.T.: Conditional directed graph convolution for 3D human pose estimation. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 602–611 (2021)
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7718–7727 (2019)
Ito, K., Tada, M., Ujike, H., Hyodo, K.: Effects of the weight and balance of head-mounted displays on physical load. Appl. Sci. 11(15), 6802 (2021)
Jeon, H.G., Lee, J.Y., Im, S., Ha, H., Kweon, I.S.: Stereo matching with color and monochrome cameras in low-light conditions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4086–4094 (2016)
Jiang, H., Ithapu, V.K.: Egocentric pose estimation from human vision span. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10986–10994. IEEE (2021)
Khirodkar, R., Bansal, A., Ma, L., Newcombe, R., Vo, M., Kitani, K.: EgoHumans: an egocentric 3D multi-human benchmark. arXiv preprint arXiv:2305.16487 (2023)
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)
Li, J., Liu, C., Wu, J.: Ego-body pose estimation via ego-head pose estimation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society (2023)
Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014, Part II. LNCS, vol. 9004, pp. 332–347. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16808-1_23
Pantone LLC: Pantone SkinTone Guide (2012)
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VIII. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Parger, M., Mueller, J.H., Schmalstieg, D., Steinberger, M.: Human upper-body inverse kinematics for increased embodiment in consumer-grade virtual reality. In: Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology, VRST 2018 (2018)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.: Lightweight multi-view 3D pose estimation through camera-disentangled representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6040–6049 (2020)
Rhodin, H., et al.: EgoCap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. 35(6), 1–11 (2016)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Proceedings Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152 (2001)
Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 (2014)
Smith, L., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates, p. 36 (2019). https://doi.org/10.1117/12.2520589
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P.: Structured prediction of 3D human pose with deep neural networks. arXiv preprint arXiv:1605.05180 (2016)
Tekin, B., Rozantsev, A., Lepetit, V., Fua, P.V.: Direct prediction of 3D body poses from motion compensated sequences. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Tome, D., et al.: SelfPose: 3D egocentric pose estimation from a headset mounted camera. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6794–6806 (2020). https://doi.org/10.1109/TPAMI.2020.3029700
Tome, D., Peluse, P., Agapito, L., Badino, H.: xR-EgoPose: egocentric 3D human pose from an HMD camera. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7728–7738 (2019)
Wang, J., Liu, L., Xu, W., Sarkar, K., Luvizon, D., Theobalt, C.: Estimating egocentric 3D human pose in the wild with external weak supervision. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society (2022)
Wang, J., Luvizon, D., Xu, W., Liu, L., Sarkar, K., Theobalt, C.: Scene-aware egocentric 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3D pose estimation from videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 764–780. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_45
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Winkler, A., Won, J., Ye, Y.: QuestSim: human motion tracking from sparse sensors with simulated avatars. In: SIGGRAPH Asia 2022 Conference Papers (2022)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481 (2018)
Xu, W., et al.: Mo\(^{2}\)Cap\(^{2}\): real-time mobile 3D motion capture with a cap-mounted fisheye camera. IEEE Trans. Vis. Comput. Graph. 25(5), 2093–2101 (2019)
Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13232–13242 (2022)
Zhang, Y., You, S., Gevers, T.: Automatic calibration of the fisheye camera for egocentric 3D human pose estimation from a single image. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1772–1781 (2021)
Zhao, D., Wei, Z., Mahmud, J., Frahm, J.M.: EgoGlass: egocentric-view human pose estimation from an eyeglass frame. In: 2021 International Conference on 3D Vision (3DV), pp. 32–41 (2021)
Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: MotionBERT: a unified perspective on learning human motion representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15085–15099 (2023)
Acknowledgements
The authors would like to acknowledge the anonymous reviewers for their comments and corrections; Xuetong Sun and Fan Bu for their work on productizing body tracking; Steve Olsen, Kevin Harris, Steve Miller, Kaichen Sun, Ben Watson, Matthew Prasak, Daniel Frey, Gunnar Grismore, Andrew Anderson, Mark Hogan, and Neha Chachra for their help with data collection; David Dimond and Weijie Yu for their help with annotation; and Anastasia Tkach for machine learning development and experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, A. et al. (2025). EgoBody3M: Egocentric Body Tracking on a VR Headset using a Diverse Dataset. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15137. Springer, Cham. https://doi.org/10.1007/978-3-031-72986-7_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-72986-7_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72985-0
Online ISBN: 978-3-031-72986-7
eBook Packages: Computer ScienceComputer Science (R0)