[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3503161.3547796acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion

Published: 10 October 2022 Publication History

Abstract

Multi-person motion capture can be challenging due to ambiguities caused by severe occlusion, fast body movement, and complex interactions. Existing frameworks build on 2D pose estimations and triangulate to 3D coordinates via reasoning the appearance, trajectory, and geometric consistencies among multi-camera observations. However, 2D joint detection is usually incomplete and with wrong identity assignments due to limited observation angle, which leads to noisy 3D triangulation results. To overcome this issue, we propose to explore the short-range autoregressive characteristics of skeletal motion using transformer. First, we propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity. To generate complete 3D skeletal motion, we then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion. D-MAE's flexible masking and encoding mechanism enable arbitrary skeleton definitions to be conveniently deployed under the same framework. In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset of multi-person interactions with severe occlusion. Evaluations on both benchmark and our new dataset demonstrate the efficiency of our proposed model, as well as its advantage against the other state-of-the-art methods.

Supplementary Material

MP4 File (MM22-fp251.mp4)
In this presentation video, we first introduce the background of the multi-view multi-person motion capturing task. Then we present our main contributions in four-fold: 1) dual-masked auto-encoder module for motion completion, 2) adaptive triangulation module for motion reconstruction, 3) the motion capture framework for SOTA performance production, 4) a large-scale multi-view multi-person mo-cap dataset BU-Mocap with manual annotation. Last, we report the evaluation result.

References

[1]
Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 2014. 3D pictorial structures for multiple human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1669--1676.
[2]
Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 2015. 3D pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2015), 1929--1942.
[3]
Vasileios Belagiannis, Xinchao Wang, Bernt Schiele, Pascal Fua, Slobodan Ilic, and Nassir Navab. 2014. Multiple human pose estimation with temporally consistent 3D pictorial structures. In European Conference on Computer Vision. Springer, 742--754.
[4]
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 1 (2019), 172--186.
[5]
Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, and Shuang Liu. 2020. Crossview tracking for multi-human 3d pose estimation at over 100 fps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3279--3288.
[6]
Hau Chu, Jia-Hong Lee, Yao-Chih Lee, Ching-Hsien Hsu, Jia-Da Li, and Chu-Song Chen. 2021. Part-aware measurement for robust multi-view multi-human 3d pose estimation and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1472--1481.
[7]
Hai Ci, Xiaoxuan Ma, ChunyuWang, and YizhouWang. 2022. Locally Connected Network for Monocular 3D Human Pose Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022), 1429--1442.
[8]
Robert T Collins. 1996. A space-sweep approach to true multi-image matching. In Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 358--363.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10]
Junting Dong, Qi Fang, Wen Jiang, Yurou Yang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. 2021. Fast and Robust Multi-Person 3D Pose Estimation and Tracking from Multiple Views. IEEE Transactions on Pattern Analysis and Machine Intelligence 14, 8 (2021), 1--1. https://doi.org/10.1109/TPAMI.2021.3098052
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021.
[12]
Yinglin Duan, Tianyang Shi, Zhengxia Zou, Yenan Lin, Zhehui Qian, Bohan Zhang, and Yi Yuan. 2021. Single-Shot Motion Completion with Transformer. arXiv preprint arXiv:2103.00776 (2021).
[13]
Félix G Harvey and Christopher Pal. 2018. Recurrent transition networks for character locomotion. In SIGGRAPH Asia Technical Briefs. 1--4.
[14]
Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. 2020. Robust motion in-betweening. ACM Transactions on Graphics 39, 4 (2020), 60--1.
[15]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021).
[16]
Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. 2015. Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision. 3334--3342.
[17]
Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Godisart, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. 2019. Panoptic Studio: A Massively Multiview System for Social Interaction Capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 1 (2019), 190--204.
[18]
Manuel Kaufmann, Emre Aksan, Jie Song, Fabrizio Pece, Remo Ziegler, and Otmar Hilliges. 2020. Convolutional autoencoders for human motion infilling. In International Conference on 3D Vision. IEEE, 918--927.
[19]
Andreas M Lehrmann, Peter V Gehler, and Sebastian Nowozin. 2014. Efficient nonlinear markov models for human motion. In IEEE Conference on Computer Vision and Pattern Recognition. 1314--1321.
[20]
Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. 2022. Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation. IEEE Transactions on Multimedia 3, 1 (2022), 1--1.
[21]
Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. 2021. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. arXiv preprint arXiv:2111.12707 (2021).
[22]
Jiahao Lin and Gim Hee Lee. 2021. Multi-view multi-person 3d pose estimation with plane sweep stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11886--11895.
[23]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[24]
Xiaoxuan Ma, Jiajun Su, Chunyu Wang, Hai Ci, and Yizhou Wang. 2021. Context modeling in 3d human pose estimation: A unified perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6238--6247.
[25]
Microsoft. 2019. Azure Kinect DK - Build for mixed reality using AI sensors. https://azure.microsoft.com/en-us/services/kinect-dk/. Accessed January 29, 2022.
[26]
Noitom. 2020. Noitom Motion Capture Systems - Motion Capture for All. https://www.noitom.com/. Accessed February 29, 2022.
[27]
OptiTrack. 2019. OptiTrack - Motion Capture Systems. https://optitrack.com/. Accessed August 19, 2021.
[28]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc.
[29]
N Dinesh Reddy, Laurent Guigues, Leonid Pishchulin, Jayan Eledath, and Srinivasa G Narasimhan. 2021. TesseTrack: End-to-End Learnable Multi-Person Articulated 3D Pose Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15190--15200.
[30]
Charles Rose, Brian Guenter, Bobby Bodenheimer, and Michael F Cohen. 1996. Efficient generation of motion transitions using spacetime constraints. In Annual conference on Computer graphics and interactive techniques. 147--154.
[31]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958.
[32]
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. 2020. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33 (2020), 7537--7547.
[33]
Hanyue Tu, Chunyu Wang, and Wenjun Zeng. 2020. Voxelpose: Towards multicamera 3d human pose estimation in wild environment. In European Conference on Computer Vision. Springer, 197--212.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H.Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
[35]
Jack M Wang, David J Fleet, and Aaron Hertzmann. 2007. Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 2 (2007), 283--298.
[36]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018), 3482--3489.
[37]
Ailing Zeng, Xiao Sun, Fuyang Huang, Minhao Liu, Qiang Xu, and Stephen Lin. 2020. Srnet: Improving generalization in 3d human pose estimation with a splitand-recombine approach. In European Conference on Computer Vision. Springer, 507--523.
[38]
Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and Yebin Liu. 2020. 4D association graph for realtime multi-person motion capture using multiple video cameras. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1324--1333.
[39]
Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 2021. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11656--11665.

Cited By

View all
  • (2024)Self-Supervised Time Series Representation Learning via Cross Reconstruction TransformerIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.329206635:11(16129-16138)Online publication date: Nov-2024
  • (2024)Motion Part-Level Interpolation and Manipulation over Automatic Symbolic Labanotation Annotation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650779(1-8)Online publication date: 30-Jun-2024
  • (2024)Exploring Latent Cross-Channel Embedding for Accurate 3d Human Pose Reconstruction in a Diffusion FrameworkICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448487(7870-7874)Online publication date: 14-Apr-2024
  • Show More Cited By

Index Terms

  1. A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 10 October 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. 3D human pose estimation
        2. masked auto-encoder
        3. motion capture
        4. spatial-temporal encoding
        5. transformer

        Qualifiers

        • Research-article

        Funding Sources

        • Research Grants Council of Hong Kong

        Conference

        MM '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)40
        • Downloads (Last 6 weeks)3
        Reflects downloads up to 12 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Self-Supervised Time Series Representation Learning via Cross Reconstruction TransformerIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.329206635:11(16129-16138)Online publication date: Nov-2024
        • (2024)Motion Part-Level Interpolation and Manipulation over Automatic Symbolic Labanotation Annotation2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650779(1-8)Online publication date: 30-Jun-2024
        • (2024)Exploring Latent Cross-Channel Embedding for Accurate 3d Human Pose Reconstruction in a Diffusion FrameworkICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448487(7870-7874)Online publication date: 14-Apr-2024
        • (2024)ReChoreoNet: Repertoire-based Dance Re-choreography with Music-conditioned Temporal and Style CluesMachine Intelligence Research10.1007/s11633-023-1478-921:4(771-781)Online publication date: 29-May-2024

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media