Motion Capture from Internet Videos

Junting Dong¹²,
Qing Shuai¹²,
Yuanqing Zhang¹²,
Xian Liu¹²,
Xiaowei Zhou¹² &
…
Hujun Bao^12,13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12347))

Included in the following conference series:

European Conference on Computer Vision

6843 Accesses

Abstract

Recent advances in image-based human pose estimation make it possible to capture 3D human motion from a single RGB video. However, the inherent depth ambiguity and self-occlusion in a single view prohibit the recovery of as high-quality motion as multi-view reconstruction. While multi-view videos are not common, the videos of a celebrity performing a specific action are usually abundant on the Internet. Even if these videos were recorded at different time instances, they would encode the same motion characteristics of the person. Therefore, we propose to capture human motion by jointly analyzing these Internet videos instead of using single videos separately. However, this new task poses many new challenges that cannot be addressed by existing methods, as the videos are unsynchronized, the camera viewpoints are unknown, the background scenes are different, and the human motions are not exactly the same among videos. To address these challenges, we propose a novel optimization-based framework and experimentally demonstrate its ability to recover much more precise and detailed motion from multiple videos, compared against monocular motion capture methods.

J. Dong and Q. Shuai—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM transactions on graphics (TOG), pp. 408–416 (2005)
Google Scholar
Bo, L., Sminchisescu, C.: Twin gaussian processes for structured prediction. Int. J. Comput. Vis. 87, 28 (2010). https://doi.org/10.1007/s11263-008-0204-y
Article Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Chapter Google Scholar
Burenius, M., Sullivan, J., Carlsson, S.: 3D pictorial structures for multiple view articulated pose estimation. In: CVPR, pp. 3618–3625 (2013)
Google Scholar
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: Open pose: real time multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)
Google Scholar
Caspi, Y., Irani, M.: Spatio-temporal alignment of sequences. IEEE Trans. Pattern Anal. Mach. Intell. 24(11), 1409–1424 (2002)
Article Google Scholar
Chen, C.H., Ramanan, D.: 3D human pose estimation= 2D pose estimation + matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7035–7043 (2017)
Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded Pyramid Network for Multi-Person Pose Estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112 (2018)
Google Scholar
Dong, J., Jiang, W., Huang, Q., Bao, H., Zhou, X.: Fast and robust multi-person 3D pose estimation from multiple views. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7792–7801 (2019)
Google Scholar
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycle-consistency learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1801–1810 (2019)
Google Scholar
Elhayek, A., et al.: Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3810–3818 (2015)
Google Scholar
Elhayek, A., Stoll, C., Hasler, N., Kim, K.I., Seidel, H.P., Theobalt, C.: Spatio-temporal motion tracking with unsynchronized cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1870–1877. IEEE (2012)
Google Scholar
Elhayek, A., Stoll, C., Kim, K.I., Theobalt, C.: Outdoor human motion capture by simultaneous optimization of pose and camera parameters. Comput. Graph. Forum 34(6), 86–98 (2015)
Article Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343 (2017)
Google Scholar
Gall, J., Rosenhahn, B., Brox, T., Seidel, H.P.: Optimization and filtering for human motion capture. Int. J. Comput. Vis. 87(1–2), 75 (2010). https://doi.org/10.1007/s11263-008-0173-1
Article Google Scholar
Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1381–1388. IEEE (2009)
Google Scholar
Guler, R.A., Kokkinos, I.: Holopose: holistic 3D human reconstruction in-the-wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10884–10894 (2019)
Google Scholar
Hasler, N., Rosenhahn, B., Thormahlen, T., Wand, M., Gall, J., Seidel, H.P.: Markerless motion capture with unsynchronized moving cameras. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 224–231 IEEE (2009)
Google Scholar
Huang, Q.X., Guibas, L.: Consistent shape maps via semidefinite programming. Comput. Graph. Forum 32(5), 177–186 (2013)
Article Google Scholar
Huang, Y., et al.: Towards accurate marker-less human shape and pose estimation over time. In: 2017 international conference on 3D vision (3DV), pp. 421–430. IEEE (2017)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human 3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Article Google Scholar
Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8320–8329 (2018)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
Google Scholar
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019)
Google Scholar
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: Closing the loop between 3d and 2d human representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6050-6059 (2017)
Google Scholar
Lee, C.S., Elgammal, A.: Coupled visual and kinematic manifold models for tracking. Int. J. Comput. Vis. 87(1–2), 118 (2010)
Article Google Scholar
Li, R., Tian, T.P., Sclaroff, S., Yang, M.H.: 3D human motion tracking with a coordinated mixture of factor analyzers. Int. J. Comput. Vis. 87(1–2), 170 (2010)
Article Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG). 34(6), pp. 1–16 (2015)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Google Scholar
Moreno-Noguer, F.: 3D human pose estimation from a single image via distance matrix regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2823–2832 (2017)
Google Scholar
Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In 2018 international conference on 3D vision (3DV), pp. 484–494. IEEE (2018)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019)
Google Scholar
Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7307–7316 (2018)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Harvesting multiple views for marker-less 3D human pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6988–6997 (2017)
Google Scholar
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753-7762 (2019)
Google Scholar
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. (ToG) 36(6), 245 (2017)
Article Google Scholar
Saini, N., et al.: Markerless outdoor human motion capture using multiple autonomous micro aerial vehicles. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 823–832 (2019)
Google Scholar
Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134-1141. IEEE (2018)
Google Scholar
Sigal, L., Balan, A., Black, M.J.: Combined discriminative and generative articulated pose and non-rigid shape estimation. In: Advances in neural information processing systems, pp. 1337–1344 (2008)
Google Scholar
Sigal, L., Isard, M., Haussecker, H., Black, M.J.: Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98(1), 15–48 (2012)
Article MathSciNet Google Scholar
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2602–2611 (2017)
Google Scholar
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545 (2018)
Google Scholar
Tekin, B., Márquez-Neila, P., Salzmann, M., Fua, P.: Learning to fuse 2D and 3D image cues for monocular body pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3941–3950 (2017)
Google Scholar
Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2500–2509 (2017)
Google Scholar
Tuytelaars, T., Van Gool, L.: Synchronizing video sequences. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, pp. 1–1. IEEE (2004)
Google Scholar
Ukrainitz, Y., Irani, M.: Aligning sequences and actions by maximizing space-time correlations. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 538–550. Springer, Heidelberg (2006). https://doi.org/10.1007/11744078_42
Chapter Google Scholar
Wang, O., Schroers, C., Zimmer, H., Gross, M., Sorkine-Hornung, A.: Videosnapping: interactive synchronization of multiple videos. ACM Trans. Graph. (TOG) 33(4), 1–10 (2014)
Google Scholar
Wang, Y., Liu, Y., Tong, X., Dai, Q., Tan, P.: Outdoor markerless motion capture with sparse handheld video cameras. IEEE Trans. Visual. Comput. Graph. 24(5), 1856–1866 (2017)
Article Google Scholar
Wolf, L., Zomet, A.: Wide baseline matching between unsynchronized video sequences. Int. J. Comput. Vis. 68(1), 43–52 (2006)
Article Google Scholar
Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: posing face, body, and hands in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10965–10974 (2019)
Google Scholar
Xu, X., Dunn, E.: Discrete laplace operator estimation for dynamic 3D reconstruction. arXiv preprint arXiv:1908.11044 (2019)
Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3D pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2148–2157 (2018)
Google Scholar
Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3D sensing of multiple people in natural images. In: Advances in Neural Information Processing Systems, pp. 8410–8419 (2018)
Google Scholar
Zheng, E., Ji, D., Dunn, E., Frahm, J.M.: Sparse dynamic 3D reconstruction from unsynchronized videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4435–4443 (2015)
Google Scholar
Zhou, X., Zhu, M., Daniilidis, K.: Multi-image matching via fast alternating minimization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4032–4040 (2015)
Google Scholar
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K.G., Daniilidis, K.: Sparseness meets deepness: 3d human pose estimation from monocular video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4966-4975 (2016)
Google Scholar
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 398–407 (2017)
Google Scholar

Download references

Acknowledgement

The authors would like to acknowledge support from NSFC (No. 61806176) and Fundamental Research Funds for the Central Universities (2019QNA5022).

Author information

Authors and Affiliations

State Key Lab of CAD&CG, Zhejiang University, Hangzhou, China
Junting Dong, Qing Shuai, Yuanqing Zhang, Xian Liu, Xiaowei Zhou & Hujun Bao
Zhejiang Lab, Hangzhou, China
Hujun Bao

Authors

Junting Dong
View author publications
You can also search for this author in PubMed Google Scholar
Qing Shuai
View author publications
You can also search for this author in PubMed Google Scholar
Yuanqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hujun Bao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hujun Bao .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dong, J., Shuai, Q., Zhang, Y., Liu, X., Zhou, X., Bao, H. (2020). Motion Capture from Internet Videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12347. Springer, Cham. https://doi.org/10.1007/978-3-030-58536-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-58536-5_13
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58535-8
Online ISBN: 978-3-030-58536-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics