[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3681654acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

SATPose: Improving Monocular 3D Pose Estimation with Spatial-aware Ground Tactility

Published: 28 October 2024 Publication History

Abstract

Estimating 3D human poses from monocular images is an important research area with many practical applications. However, the depth ambiguity of 2D solutions limits their accuracy in actions where occlusion exits or where slight centroid shifts can result in significant 3D pose variations. In this paper, we introduce a novel multimodal approach to mitigate the depth ambiguity inherent in monocular solutions by integrating spatial-aware pressure information. We first establish a data collection system with a pressure mat and a monocular camera, and construct a large-scale multimodal human activity dataset comprising over 600,000 frames of motion data. Utilizing this dataset, we propose a pressure image reconstruction network to extract pressure priors from monocular images. Subsequently, we introduce a Transformer-based multimodal pose estimation network to combine pressure priors with monocular images, achieving a world mean per joint position error of 51.6mm, outperforming state-of-the-art methods. Extensive experiments demonstrate the effectiveness of our multimodal 3D human pose estimation method across various actions and joints, highlighting the significance of spatial-aware pressure in improving the accuracy of monocular-vision-based methods. Our dataset is available at: https://github.com/LishuangZhan/SATPose.

Supplemental Material

MP4 File - 5353#-video.mp4
Video presentation about multimodal 3D human pose estimation with monocular vision and groud pressure.

References

[1]
Ankur Agarwal and Bill Triggs. 2005. Recovering 3D human pose from monocular images. IEEE transactions on pattern analysis and machine intelligence, Vol. 28, 1 (2005), 44--58.
[2]
Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 1014--1021.
[3]
Aakash Bhatt, Thomas Truong, Svetlana Yanushkevich, and Mohammed Almekhlafi. 2021. Body Pose Analysis using CNN and Pressure Sensor Array Data. In 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 1--4.
[4]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291--7299.
[5]
Leslie Casas, Nassir Navab, and Stefanie Demirci. 2019. Patient 3D body pose estimation from pressure imaging. International journal of computer assisted radiology and surgery, Vol. 14 (2019), 517--524.
[6]
Wenqiang Chen, Yexin Hu, Wei Song, Yingcheng Liu, Antonio Torralba, and Wojciech Matusik. 2024. CAvatar: Real-time Human Activity Mesh Reconstruction via Tactile Carpets. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 7, 4 (2024), 1--24.
[7]
Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7103--7112.
[8]
Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. 2020. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VII 16. Springer, 769--787.
[9]
Henry M Clever, Zackory Erickson, Ariel Kapusta, Greg Turk, Karen Liu, and Charles C Kemp. 2020. Bodies at rest: 3d human pose and shape estimation from a pressure image using synthetic data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6215--6224.
[10]
Peishan Cong, Yiteng Xu, Yiming Ren, Juze Zhang, Lan Xu, Jingya Wang, Jingyi Yu, and Yuexin Ma. 2023. Weakly supervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 461--469.
[11]
Gray Cook, Lee Burton, Barbara J Hoogenboom, and Michael Voight. 2014. Functional movement screening: The use of fundamental movements as an assessment of function-part 1. International journal of sports physical therapy, Vol. 9, 3 (2014), 396.
[12]
Carl Doersch and Andrew Zisserman. 2019. Sim2real transfer learning for 3d human pose estimation: motion to the rescue. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[13]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[14]
Fengyi Fang, Hongwei Zhang, Lishuang Zhan, Shihui Guo, Minying Zhang, Juncong Lin, Yipeng Qin, and Hongbo Fu. 2023. Handwriting velcro: Endowing AR glasses with personalized and posture-adaptive text input using flexible touch sensor. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 6, 4 (2023), 1--31.
[15]
M. Farshbaf, R. Yousefi, M. Baran Pouyan, S. Ostadabbas, M. Nourani, and M. Pompeo. 2013. Detecting high-risk regions for pressure ulcer risk assessment. In 2013 IEEE International Conference on Bioinformatics and Biomedicine. 255--260. https://doi.org/10.1109/BIBM.2013.6732499
[16]
Michael Fürst, Shriya TP Gupta, René Schuster, Oliver Wasenmüller, and Didier Stricker. 2021. HPERL: 3d human pose estimation from RGB and lidar. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 7321--7327.
[17]
Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. 2021. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4318--4329.
[18]
Mohammed Hassanin, Abdelwahed Khamiss, Mohammed Bennamoun, Farid Boussaid, and Ibrahim Radwan. 2022. Crossformer: Cross spatio-temporal transformer for 3d human pose estimation. arXiv preprint arXiv:2203.13387 (2022).
[19]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.
[20]
Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang, and Tien-Tsin Wong. 2021. Conditional directed graph convolution for 3d human pose estimation. In Proceedings of the 29th ACM International Conference on Multimedia. 602--611.
[21]
Catalin Ionescu, Fuxin Li, and Cristian Sminchisescu. 2011. Latent structured models for human pose estimation. In 2011 International Conference on Computer Vision. IEEE, 2220--2227.
[22]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 7 (2013), 1325--1339.
[23]
Tomoya Kaichi, Tsubasa Maruyama, Mitsunori Tada, and Hideo Saito. 2020. Resolving position ambiguity of imu-based human pose with a single rgb camera. Sensors, Vol. 20, 19 (2020), 5453.
[24]
Farnaz Khoshmanesh, Peter Thurgood, Elena Pirogova, Saeid Nahavandi, and Sara Baratchi. 2021. Wearable sensors: At the frontier of personalised health monitoring, smart prosthetics and assistive technologies. Biosensors and Bioelectronics, Vol. 176 (2021), 112946.
[25]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888--12900.
[26]
Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. 2022. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Transactions on Multimedia, Vol. 25 (2022), 1282--1293.
[27]
Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. 2022. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13147--13156.
[28]
Kevin Lin, Lijuan Wang, and Zicheng Liu. 2021. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1954--1963.
[29]
Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, and Vijayan Asari. 2020. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5064--5073.
[30]
Zhong Liu, SU Mingliang, Ken Lu, et al. 2019. A method to recognize sleeping position using an CNN model based on human body pressure image. In 2019 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS). IEEE, 219--224.
[31]
Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah Ngo, and Tao Mei. 2022. Dynamic temporal filtering in video models. In European Conference on Computer Vision. Springer, 475--492.
[32]
Yiyue Luo, Yunzhu Li, Michael Foshey, Wan Shou, Pratyusha Sharma, Tomás Palacios, Antonio Torralba, and Wojciech Matusik. 2021. Intelligent carpet: Inferring 3d human pose from tactile signals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11255--11265.
[33]
Charles Malleson, Andrew Gilbert, Matthew Trumble, John Collomosse, Adrian Hilton, and Marco Volino. 2017. Real-time full-body motion capture from video and imus. In 2017 International Conference on 3D Vision (3DV). IEEE, 449--457.
[34]
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII 14. Springer, 483--499.
[35]
Shaohua Pan, Qi Ma, Xinyu Yi, Weifeng Hu, Xiong Wang, Xingkang Zhou, Jijunnan Li, and Feng Xu. 2023. Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture. In SIGGRAPH Asia 2023 Conference Papers. 1--11.
[36]
Patrick Parzer, Adwait Sharma, Anita Vogl, Jürgen Steimle, Alex Olwal, and Michael Haller. 2017. SmartSleeve: real-time sensing of surface and deformation gestures on flexible, interactive textiles, using a hybrid gesture detection pipeline. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. 565--577.
[37]
Ashok Kumar Patil, Adithya Balasubramanyam, Jae Yeong Ryu, Pavan Kumar BN, Bharatesh Chakravarthi, and Young Ho Chai. 2020. Fusion of multiple lidars and inertial sensors for the real-time pose tracking of human motion. Sensors, Vol. 20, 18 (2020), 5342.
[38]
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7025--7034.
[39]
Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7753--7762.
[40]
Narjes Pourjafarian, Anusha Withana, Joseph A. Paradiso, and Jürgen Steimle. 2019. Multi-Touch Kit: A Do-It-Yourself Technique for Capacitive Multi-Touch Sensing Using a Commodity Microcontroller. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology (New Orleans, LA, USA) (UIST '19). Association for Computing Machinery, New York, NY, USA, 1071--1083. https://doi.org/10.1145/3332165.3347895
[41]
Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, and Tao Mei. 2022. Mlp-3d: A mlp-like 3d architecture with grouped time mixing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3062--3072.
[42]
Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu, and Ioannis A Kakadiaris. 2016. 3d human pose estimation: A review of the literature and analysis of covariates. Computer Vision and Image Understanding, Vol. 152 (2016), 1--20.
[43]
Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. 2022. P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In European Conference on Computer Vision. Springer, 461--478.
[44]
Wenkang Shan, Haopeng Lu, Shanshe Wang, Xinfeng Zhang, and Wen Gao. 2021. Improving robustness and accuracy via relative information encoding in 3d human pose estimation. In Proceedings of the 29th ACM International Conference on Multimedia. 3446--3454.
[45]
Xinyu Shi, Junjun Pan, Zeyong Hu, Juncong Lin, Shihui Guo, Minghong Liao, Ye Pan, and Ligang Liu. 2019. Accurate and fast classification of foot gestures for virtual locomotion. In 2019 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 178--189.
[46]
Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, et al. 2024. AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation. arXiv preprint arXiv:2403.17934 (2024).
[47]
Mathias Sundholm, Jingyuan Cheng, Bo Zhou, Akash Sethi, and Paul Lukowicz. 2014. Smart-mat: recognizing and counting gym exercises with low-cost resistive pressure sensing matrix. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Seattle, Washington) (UbiComp '14). Association for Computing Machinery, New York, NY, USA, 373--382. https://doi.org/10.1145/2632048.2636088
[48]
Zhenhua Tang, Jia Li, Yanbin Hao, and Richang Hong. 2023. MLP-JCG: Multi-Layer Perceptron with Joint-Coordinate Gating for Efficient 3D Human Pose Estimation. IEEE Transactions on Multimedia (2023).
[49]
Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, and Ting Yao. 2023. 3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4790--4799.
[50]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[51]
Qilong Wan, Haiming Zhao, Jie Li, and Peng Xu. 2021. Hip positioning and sitting posture recognition based on human sitting pressure image. Sensors, Vol. 21, 2 (2021), 426.
[52]
Qilong Wan, Haiming Zhao, Jie Li, and Peng Xu. 2023. Human Sleeping Posture Recognition Based on Sleeping Pressure Image. IEEE Sensors Journal, Vol. 23, 4 (2023), 4069--4077. https://doi.org/10.1109/JSEN.2022.3225290
[53]
Jinbao Wang, Shujie Tan, Xiantong Zhen, Shuo Xu, Feng Zheng, Zhenyu He, and Ling Shao. 2021. Deep 3D human pose estimation: A review. Computer Vision and Image Understanding, Vol. 210 (2021), 103225.
[54]
Jingbo Wang, Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2020. Motion guided 3d pose estimation from videos. In European Conference on Computer Vision. Springer, 764--780.
[55]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[56]
Bruce XB Yu, Zhi Zhang, Yongxu Liu, Sheng-hua Zhong, Yan Liu, and Chang Wen Chen. 2023. Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8818--8829.
[57]
Lishuang Zhan, Yancheng Cao, Qitai Chen, Haole Guo, Jiasi Gao, Yiyue Luo, Shihui Guo, Guyue Zhou, and Jiangtao Gong. 2023. Enable Natural Tactile Interaction for Robot Dog based on Large-format Distributed Flexible Pressure Sensors. In 2023 IEEE International Conference on Robotics and Automation (ICRA). 12493--12499. https://doi.org/10.1109/ICRA48891.2023.10161049
[58]
Lishuang Zhan, Tianyang Xiong, Hongwei Zhang, Shihui Guo, Xiaowei Chen, Jiangtao Gong, Juncong Lin, and Yipeng Qin. 2024. TouchEditor: Interaction Design and Evaluation of a Flexible Touchpad for Text Editing of Head-Mounted Displays in Speech-unfriendly Environments. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Vol. 7, 4, Article 198 (jan 2024), 29 pages. https://doi.org/10.1145/3631454
[59]
Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. 2022. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13232--13242.
[60]
Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N Metaxas. 2019. Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3425--3435.
[61]
Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, and Chen Chen. 2023. PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8877--8886.
[62]
Ce Zheng, Wenhan Wu, Chen Chen, Taojiannan Yang, Sijie Zhu, Ju Shen, Nasser Kehtarnavaz, and Mubarak Shah. 2023. Deep learning-based human pose estimation: A survey. Comput. Surveys, Vol. 56, 1 (2023), 1--37.
[63]
Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 2021. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11656--11665.
[64]
Zerong Zheng, Tao Yu, Hao Li, Kaiwen Guo, Qionghai Dai, Lu Fang, and Yebin Liu. 2018. Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus. In Proceedings of the European Conference on Computer Vision (ECCV). 384--400.
[65]
Ersan Zhou and Heqing Zhang. 2020. Human action recognition toward massive-scale sport sceneries based on deep multi-model feature fusion. Signal Processing: Image Communication, Vol. 84 (2020), 115802.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multimodal 3d human pose estimation
  2. multimodal human activity dataset
  3. pressure image reconstruction
  4. pressure sensor

Qualifiers

  • Research-article

Funding Sources

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 75
    Total Downloads
  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)29
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media